Skip to article frontmatterSkip to article content

Your ICLR Recommendation list

There are 1000 papers for you in ICLR 2025

score_cdf

1Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation

[openreview] [pdf]

Abstract Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into a student generator to achieve one-step generation, which is optimized by calculating the difference between two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the score distillation process, since existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of the teacher model and propose Dis\textbf{Dis}tribution Back\textbf{Back}tracking Distillation (DisBack\textbf{DisBack}). DisBask is composed of two stages: Degradation Recording\textit{Degradation Recording} and Distribution Backtracking\textit{Distribution Backtracking}. Degradation Recording\textit{Degradation Recording} is designed to obtain the convergence trajectory by recording the degradation path from the pre-trained teacher model to the untrained student generator. The degradation path implicitly represents the intermediate distributions between the teacher and the student, and its reverse can be viewed as the convergence trajectory from the student generator to the teacher model. Then Distribution Backtracking\textit{Distribution Backtracking} trains the student generator to backtrack the intermediate distributions along the path to approximate the convergence trajectory of the teacher model. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and achieves comparable or better generation performance, with an FID score of 1.38 on the ImageNet 64×64 dataset. DisBack is easy to implement and can be generalized to existing distillation methods to boost performance.

2Diffusion Transformer Policy

[openreview] [pdf]

Abstract Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pre-trained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance in the Calvin novel task setting, and the pre-training stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available.

3Diffusion Policy Policy Optimization

[openreview] [pdf]

Abstract We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task.

4Diffusion Models for 4D Novel View Synthesis

[openreview] [pdf]

Abstract We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works which generally operate in limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. Seehttps://anonymous-4d-diffusion.github.iofor video samples.

5The Deficit of New Information in Diffusion Models: A Focus on Diverse Samples

[openreview] [pdf]

Abstract Diffusion models are renowned for their state-of-the-art performance in generating high-quality images. Identifying samples with new information beyond the training data is essential for data augmentation, especially for enhancing model performance in diverse and unforeseen real-world scenarios. However, the investigation of new information in the generated samples has not been well explored. Our investigation through the lens of information theory reveals that diffusion models do not produce new information beyond what exists in the training data. Next, we introduce the concept of diverse samples (DS) to prove that generated images could contain information not present in the training data for diffusion models. Furthermore, we propose a method for identifying diverse samples among generated images by extracting deep features and detecting images that fall outside the boundary of real images. We demonstrate that diverse samples exist in the generated data of diffusion models, attributed to the estimation of forward and backward processes, but it can only produce a limited number of diverse samples, underscoring a notable gap in their capabilities in generating diverse samples. In addition, our experiment on the Chest X-ray dataset demonstrates that the diverse samples are more useful in improving classification accuracy than vanilla-generated samples. The source code is available at \url{https://github.com/lypz12024/diffusion-diverse-samples}.

6What Makes a Good Diffusion Planner for Decision Making?

[openreview] [pdf]

Abstract Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.

7Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have already been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we point out that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance (CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatily effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD15, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their ability to produce more human preferences aligned outputs.

8Discovery and Expansion of New Domains within Diffusion Models

[openreview] [pdf]

Abstract In this work, we study the generalization properties of diffusion models in a few-shot setup, introduce a novel tuning-free paradigm to synthesize the target out-of-domain (OOD) data, showcase multiple applications of those generalization properties, and demonstrate the advantages compared to existing tuning-based methods in data-sparse scientific scenarios with large domain gaps. Our work resides on the observation and premise that the theoretical formulation of denoising diffusion implicit models (DDIMs), a non-Markovian inference technique, exhibits latent Gaussian priors independent from the parameters of trained denoising diffusion probabilistic models (DDPMs). This brings two practical benefits: the latent Gaussian priors generalize to OOD data domains that have never been used in the training stage; existing DDIMs offer the flexibility to traverse the denoising chain bidirectionally for a pre-trained DDPM. We then demonstrate through theoretical and empirical studies that such established OOD Gaussian priors are practically separable from the originally trained ones after inversion. The above analytical findings allow us to introduce our novel tuning-free paradigm to synthesize new images of the target unseen domain by discovering qualified OOD latent encodings within the inverted noisy latent spaces, which is fundamentally different from most existing paradigms that seek to modify the denoising trajectory to achieve the same goal by tuning the model parameters. Extensive cross-model and domain experiments show that our proposed method can expand the latent space and synthesize images in new domains via frozen DDPMs without impairing the generation quality of their original domains.

9Iterative DPO with An Improvement Model for Fine-tuning Diffusion Models

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has been proven as an effective solution in aligning generative models with human preferences. However, as shown in recent works, DPO could suffer from constraints from the offline preference dataset. This paper introduces a novel improvement approach for online iterative optimization of the diffusion models without introducing extra annotation of the online data. We propose to learn a preference improvement model to extract the implicit preference from the preference dataset. The learned improvement model is then used to generate winning images from the images generated by the current diffusion model. We can construct new pairs of preference data by using images generated by the current diffusion model as losing images, and its corresponding improved images as winning images. The diffusion model can therefore be optimized via iteratively applying online preference datasets. This method enables online improvement beyond offline DPO training without requiring additional human labeling or risking overfitting the reward model. Results demonstrate improvements in preference alignment with higher diversity compared with other fine-tuning methods. Our work bridges the gap between offline preference learning and online improvement, offering a promising direction for enhancing diffusion models in image generation tasks with limited preference data.

10Adding Conditional Control to Diffusion Models with Reinforcement Learning

[openreview] [pdf]

Abstract Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method based on reinforcement learning (RL) to add such controls using an offline dataset comprising inputs and labels. We formulate this task as an RL problem, with the classifier learned from the offline dataset and the KL divergence against pre-trained models serving as the reward functions. Our method,CTRL(Conditioning pre-Trained diffusion models withReinforcementLearning), produces soft-optimal policies that maximize the abovementioned reward functions. We formally demonstrate that our method enables sampling from the conditional distribution with additional controls during inference. Our RL-based approach offers several advantages over existing methods. Compared to classifier-free guidance, it improves sample efficiency and can greatly simplify dataset construction by leveraging conditional independence between the inputs and additional controls. Additionally, unlike classifier guidance, it eliminates the need to train classifiers from intermediate states to additional controls.

11Diffusion Models Meet Contextual Bandits

[openreview] [pdf]

Abstract Efficient exploration in contextual bandits is crucial due to their large action space, where uninformed exploration can lead to computational and statistical inefficiencies. However, the rewards of actions are often correlated, which can be leveraged for more efficient exploration. In this work, we use pre-trained diffusion model priors to capture these correlations and develop diffusion Thompson sampling (dTS). We establish both theoretical and algorithmic foundations for dTS. Specifically, we derive efficient posterior approximations (required by dTS) under a diffusion model prior, which are of independent interest beyond bandits and reinforcement learning. We analyze dTS in linear instances and provide a Bayes regret bound. Our experiments validate our theory and demonstrate dTS’s favorable performance.

12Influence-Guided Diffusion for Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation aims to streamline the training process by creating a compact yet effective dataset for a much larger original dataset. However, existing methods often struggle with distilling large, high-resolution datasets due to prohibitive resource costs and limited performance, primarily stemming from sample-wise optimizations in the pixel space. Motivated by the remarkable capabilities of diffusion generative models in learning target dataset distributions and controllably sampling high-quality data tailored to user needs, we propose framing dataset distillation as a controlled diffusion generation task aimed at generating data specifically tailored for effective training purposes. By establishing a correlation between the overarching objective of dataset distillation and the trajectory influence function, we introduce the Influence-Guided Diffusion (IGD) sampling framework to generate training-effective data without the need to retrain diffusion models. An efficient guided function is designed by leveraging the trajectory influence function as an indicator to steer diffusions to produce data with influence promotion and diversity enhancement. Extensive experiments show that the training performance of distilled datasets generated by diffusions can be significantly improved by integrating with our IGD method and achieving state-of-the-art performance in distilling ImageNet datasets. Particularly, an exceptional result is achieved on the ImageNet-1K, reaching 60.3% at IPC=50.

13Inverse Engineering Diffusion: Deriving Variance Schedules with Rationale

[openreview] [pdf]

Abstract A fundamental aspect of diffusion models is the variance schedule, which governs the evolution of variance throughout the diffusion process. Despite numerous studies exploring variance schedules, little effort has been made to understand the variance distributions implied by sampling from these schedules and how it benefits both training and data generation. We introduce a novel perspective on score-based diffusion models, bridging the gap between the variance schedule and its underlying variance distribution. Specifically, we propose the notion of sampling variance according to a probabilistic rationale, which induces a density. Our approach views the inverse of the variance schedule as a cumulative distribution function (CDF) and its first derivative as a probability density function (PDF) of the variance distribution. This formulation not only offers a unified view of variance schedules but also allows for the direct engineering of a variance schedule from the probabilistic rationale of its inverse function. Additionally, our framework is not limited to CDFs with closed-form inverse solutions, enabling the exploration of variance schedules that are unattainable through conventional methods. We present the tools required to obtain a diverse array of novel variance schedules tailored to specific rationales, such as separability metrics or prior beliefs. These schedules may exhibit varied dynamics, ranging from rapid convergence towards zero to prolonged periods in high-variance regions. Through comprehensive empirical evaluation, we demonstrate the efficacy of enhancing the performance of diffusion models with schedules distinct from those encountered during training. We provide a principled and unified approach to variance schedules in diffusion models, revealing the relationship between variance schedules and their underlying probabilistic rationales, which yields notable improvements in image generation performance, as measured by FID.

14Progressive distillation induces an implicit curriculum

[openreview] [pdf]

Abstract Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several “intermediate” teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student’s learning. This curriculum is available only through the intermediate checkpoints but not the final converged one, and imparts both empirical acceleration and a provable sample complexity benefit to the student. We then extend our investigation to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.

15One Step Diffusion via Shortcut Models

[openreview] [pdf]

Abstract Diffusion models and flow matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce Shortcut Models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.

16Unveiling Concept Attribution in Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as object or style. Recent works employ causal tracing to localize layers storing knowledge in generative models. In this work, we approach from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?‘’}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show that there exist components that contribute negatively to a concept that has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models.

[openreview] [pdf]

Abstract Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is crucial for time-series forecasting models to produce robust predictions under potential distribution shifts. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series, designing proper concept drift methods for time series data received comparatively less attention.Motivated by the need to mitigate potential concept drift issues in time-series forecasting, this work proposes a novel soft attention mechanism that effectively leverages and ensemble information from the horizon time series. Furthermore, recognizing that both concept drift and temporal shift could occur concurrently in time-series forecasting scenarios while an integrated solution remains missing, this paper introduces ShifTS, a model-agnostic framework seamlessly addressing both concept drift and temporal shift issues in time-series forecasting. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and consistently outperforming existing concept drift, temporal shift, and combined baselines.

18Can Diffusion Models Disentangle? A Theoretical Perspective

[openreview] [pdf]

Abstract This paper introduces a novel theoretical framework to understand how diffusion models can learn disentangled representations under the assumption of an \normltwo score approximation. We also provide sufficient conditions under which such representations are beneficial for domain adaptation. Our theory offers new insights into how existing diffusion models disentangle latent variables across general distributions and suggests strategies to enhance their disentanglement capabilities. To validate our theory, we perform experiments using both synthetic data generated from latent subspace models and real speech data for non-parallel voice conversion - a canonical disentanglement problem. Across various classification tasks, we found voice conversion-based adaptation methods achieve significant improvements in classification accuracy, demonstrating their effectiveness as domain adaptors. Code will be released upon acceptance.

19Fast Multi-Mode Adaptive Generative Distillation for Continually Learning Diffusion Models

[openreview] [pdf]

Abstract Diffusion models are powerful generative models, but their computational demands, vulnerability to catastrophic forgetting, and class imbalance in generated data pose significant challenges in continual learning scenarios. In this paper, we introduce Fast Multi-Mode Adaptive Generative Distillation (MAGD), a novel approach designed to address these three core challenges. MAGD combines generative replay and knowledge distillation, enhancing the continual training of diffusion models through three key innovations: (1) Noisy Intermediate Generative Distillation (NIGD), which leverages intermediate noisy images during the reverse diffusion process to improve data utility and preserve image quality without additional computational costs; (2) Class-guided generative distillation (CGGD), which uses classifier guidance to ensure balanced class representation in generated images, addressing the issue of class imbalance in traditional methods; and (3) Signal-Guided Generative Distillation (SGGD), which reduces computational overhead while maintaining image clarity through the reuse of the model’s denoising capabilities across tasks. Our experimental results on Fashion-MNIST, CIFAR-10, and CIFAR-100 demonstrate that MAGD significantly outperforms existing methods in both image quality, measured by Fréchet Inception Distance (FID), and class balance, measured by Kullback-Leibler Divergence (KLD). Moreover, MAGD achieves competitive results with far fewer generation steps compared to traditional methods, making it a practical solution for real-life continual learning applications.

20A Tailored Framework for Aligning Diffusion Models with Human Preference

[openreview] [pdf]

Abstract The direct preference optimization (DPO) method has shown success in aligning text-to-image diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generated images and their corresponding noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we identify a significant issue with this consistency assumption, as directly applying DPO to noisy samples from different generation trajectories based on final preference order may disrupt the optimization process. We first demonstrate the issues inherent in previous methods from two perspectives:gradient directionandpreference order, and then propose aTailoredPreferenceOptimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks the preference order of intermediate noisy samples based on their step-wise reward, and effectively resolves the optimization direction issues through a simple yet efficient design. Additionally, to the best of our knowledge, we are the first to consider the distinct structure of diffusion models and leverage the gradient guidance in preference aligning to enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model’s ability to generate aesthetically pleasing and human-preferred images.

21Optimal Targets for Concept Erasure in Diffusion Models and Where To Find Them

[openreview] [pdf]

Abstract Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects neutral concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance.

22Protecting Minorities in Diffusion Models via Capacity Allocation

[openreview] [pdf]

Abstract Diffusion models have advanced quickly in image generation. However, their performance declines significantly on the imbalanced data commonly encountered in real-world scenarios. Current research on imbalanced diffusion models focuses on improving the objective function to facilitate knowledge transfer between majorities and minorities, thereby enhancing the generation of minority samples. In this paper, we make the first attempt to address the imbalanced data challenges in diffusion models from the perspective of model capacity. Specifically, majorities occupy most of the model capacity because of their larger representation, consequently restricting the capacity available for minority classes. To tackle this challenge, we propose Protecting Minorities via Capacity ALLocation (CALL). We reserve capacity for minority expertise by low-rank decomposing the model parameters and allocate the corresponding knowledge to the reserved model capacity through a capacity allocation loss function. Extensive experiments demonstrate that our method, which is orthogonal to existing methods, consistently and significantly improves the robustness of diffusion models on imbalanced data.

23DC-DPM: A Divide-and-Conquer Approach for Diffusion Reverse Process

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generative tasks. However, previous approaches typically approximate the reversed transition kernel with a Gaussian distribution. This approximation can diverge from real scenarios, necessitating multiple iterative steps for high-quality sample generation and limiting the real-time inference performance of diffusion models. In this paper, we propose a \textbf{D}ivide-and-\textbf{C}onquer strategy to improve the traditional single Gaussian transition kernel representation in each denoising step of \textbf{D}iffusion \textbf{P}robabilistic \textbf{M}odels (DC-DPM), thus enhancing generation quality particularly over a limited number of timesteps. By dividing the data into clusters, our DC-DPM learns specific kernels for each partition. We design two merging strategies for these cluster-specific kernels along with corresponding training and sampling methods. We provide theoretical proof of DC-DPM’s convergence to the true data distribution from a novel perspective. Experimental results demonstrate the superior generation quality of our method compared to the traditional single Gaussian kernel. Furthermore, our DC-DPM can synergize with previous kernel optimization methods, enhancing their generation quality, especially with a small number of timesteps.

24Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts

[openreview] [pdf]

Abstract Advancements in foundation models (FMs) have led to a paradigm shift in machine learning. The rich, expressive feature representations from these pre-trained, large- scale FMs are leveraged for multiple downstream tasks, usually via lightweight fine-tuning of a shallow fully-connected network following the representation. However, the non-interpretable, black-box nature of this prediction pipeline can be a challenge, especially in critical domains, such as healthcare, finance, and security. In this paper, we explore the potential of Concept Bottleneck Models (CBMs) for transforming complex, non-interpretable foundation models into interpretable decision-making pipelines using high-level concept vectors. Specifically, we focus on the test-time deployment of such an interpretable CBM pipeline “in the wild”, where the distribution of inputs often shifts from the original training distribution. We first identify the potential failure modes of such pipelines under different types of distribution shifts. Then we propose an adaptive concept bottleneck framework to address these failure modes, that dynamically adapts the concept-vector bank and the prediction layer based solely on unlabeled data from the target domain, without access to the source dataset. Empirical evaluations with various real-world distribution shifts show our framework produces concept-based interpretations better aligned with the test data and boosts post-deployment accuracy by up to 28%, aligning CBM performance with that of non-interpretable classification.

25Latent Weight Diffusion: Generating policies from trajectories

[openreview] [pdf]

Abstract With the increasing availability of open-source robotic data, imitation learning has emerged as a viable approach for both robot manipulation and locomotion. Currently, large generalized policies are trained to predict controls or trajectories using diffusion models, which have the desirable property of learning multimodal action distributions. However, generalizability comes with a cost — namely, larger model size and slower inference. Further, there is a known trade-off between performance and action horizon for Diffusion Policy (i.e., diffusing trajectories): fewer diffusion queries accumulate greater trajectory tracking errors. Thus, it is common practice to run these models at high inference frequency, subject to robot computational constraints.To address these limitations, we propose Latent Weight Diffusion (LWD), a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories. Our approach encodes demonstration trajectories into a latent space and then decodes them into policies using a hypernetwork. We employ a diffusion denoising model within this latent space to learn its distribution. We demonstrate that LWD can reconstruct the behaviors of the original policies that generated the trajectory dataset. LWD offers the benefits of considerably smaller policy networks during inference and requires fewer diffusion model queries. When tested on the Metaworld MT10 benchmark, LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to ∼18x smaller during inference. Additionally, since LWD generates closed-loop policies, we show that it outperforms Diffusion Policy in long action horizon settings, with reduced diffusion queries during rollout.

26Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

[openreview] [pdf]

Abstract Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD’s refined initialization samples, enabling faster convergence towards the evaluator’s intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

27How do diffusion models learn and generalize on abstract rules for reasoning?

[openreview] [pdf]

Abstract Diffusion models excel in generating and completing patterns in images. But how good is their ability to learn hidden rules from samples and to generate and reason according to such rules or even generalize to similar rules? We trained a wide family of unconditional diffusion models on Raven’s progression matrix task to precisely study this. We quantified their capability to generate structurally consistent samples and complete missing parts according to hidden rules. We found diffusion models can synthesize novel samples consistent with rules without memorizing the training set, much better than GPT2 trained on the same data. They memorized and recombined local parts of the training samples to create new rule-conforming samples. When tasked to complete the missing panel with inpainting techniques, advanced sampling techniques were needed to perform well. Further, their pattern completion capability can generalize to rules unseen during training. Further, through generative training on rule data, a robust rule representation rapidly emerged in the diffusion model, which could linearly classify rules at 99.8% test accuracy. Our results suggest diffusion training is a useful paradigm for reasoning and learning representations for downstream tasks even for abstract rules data.

28O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions

[openreview] [pdf]

Abstract Score-based diffusion models, which generate new data by learning to reverse a diffusion process that perturbs data from the target distribution into noise, have achieved remarkable success across various generative tasks. Despite their superior empirical performance, existing theoretical guarantees are often constrained by stringent assumptions or suboptimal convergence rates. In this paper, we establish a fast convergence theory for a popular SDE-based sampler under minimal assumptions. Our analysis shows that, provided 2\ell_{2}-accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by O(d/T)O(d/T) (ignoring logarithmic factors), where dd is the data dimensionality and TT is the number of steps. This result holds for any target distribution with finite first-order moment. To our knowledge, this improves upon existing convergence theory for both the SDE-based sampler and another ODE-based sampler, while imposing minimal assumptions on the target data distribution and score estimates. This is achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process.

29Diffusion Models are Evolutionary Algorithms

[openreview] [pdf]

Abstract In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we mathematically demonstrate that diffusion models inherently perform evolutionary algorithms, naturally encompassing selection, mutation, and reproductive isolation. Building on this equivalence, we propose the Diffusion Evolution method: an evolutionary algorithm utilizing iterative denoising -- as originally introduced in the context of diffusion models -- to heuristically refine solutions in parameter spaces. Unlike traditional approaches, Diffusion Evolution efficiently identifies multiple optimal solutions and outperforms prominent mainstream evolutionary algorithms. Furthermore, leveraging advanced concepts from diffusion models, namely latent space diffusion and accelerated sampling, we introduce Latent Space Diffusion Evolution, which finds solutions for evolutionary tasks in high-dimensional complex parameter space while significantly reducing computational steps. This parallel between diffusion and evolution not only bridges two different fields but also opens new avenues for mutual enhancement, raising questions about open-ended evolution and potentially utilizing non-Gaussian or discrete diffusion models in the context of Diffusion Evolution.

30How and how well do diffusion models improve adversarial robustness?

[openreview] [pdf]

Abstract Recent findings suggest that diffusion models significantly enhance empirical adversarial robustness. While some intuitive explanations have been proposed, the precise mechanisms underlying these improvements remain unclear. In this work, we systematically investigate how and how well do diffusion models improve adversarial robustness. First, we observe that diffusion models intriguingly increase—rather than decrease—the p\ell_p distances to clean samples. This is the opposite of what was believed previously. Second, we find that the purified images are heavily influenced by the internal randomness of diffusion models. To properly evaluate the robustness of systems with inherent randomness, we introduce the concept of fuzzy adversarial robustness, and find that empirically a substantial fraction of adversarial examples are fuzzy in nature. Finally, by leveraging a hyperspherical cap model of adversarial regions, we show that diffusion models increase robustness by dramatically compressing the image space. Our findings provide novel insights into the mechanisms behind the robustness improvements of diffusion-model-based purification and offer guidance for the development of more efficient adversarial purification systems.

31Distributionally Robust Policy Learning under Concept Drifts

[openreview] [pdf]

Abstract Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-nn rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class Π, and show that the sub-optimality gap of the proposed algorithm is of the order κ(Π)n1/2\kappa(\Pi)n^{-1/2}, with κ(Π)\kappa(\Pi) is the entropy integral of Π under the Hamming distance and nn is the sample size. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

32Representative Guidance: Diffusion Model Sampling with Consistency

[openreview] [pdf]

Abstract The diffusion sampling process faces a persistent challenge stemming from its incoherence, attributable to varying noise directions across different time steps. Our Representative Guidance (RepG) offers a new perspective to handle this issue by reformulating the sampling process with a coherent direction towards a representative target. In this formulation, while the classic classifier guidance improves feature discernment by steering the model away from ambiguous features, it fails to provide a favorable representative target, since the class label is overly compact and leads to sacrificed diversity and the adversarial generation problem. In contrast, we leverage self-supervised representations as the coherent target and treat sampling as a downstream task, which refines image details and corrects errors rather than settling for simpler samples. Our representative guidance achieves superior performance and also illustrates the potential of pre-trained self-supervised models in image sampling. Our findings demonstrate that RepG not only substantially enhances vanilla diffusion sampling but also surpasses state-of-the-art benchmarks when combined with the classifier-free guidance. Our code will be released.

33Diffusion Transportation Cost for Domain Adaptation

[openreview] [pdf]

Abstract In recent years, there has been considerable interest in leveraging the Optimal Transport (OT) problem for domain adaptation, a strategy shown to be highly effective. However, a less explored aspect is the choice of the transportation cost function, as most existing methods rely on the pairwise squared Euclidean distances for the transportation cost, potentially overlooking important intra-domain geometries. This paper presents Diffusion-OT, a new transport cost for the OT problem, designed specifically for domain adaptation. By utilizing concepts and tools from the field of manifold learning, specifically diffusion geometry, we derive an operator that accounts for the intra-domain relationships, thereby extending beyond the conventional inter-domain distances. This operator, which quantifies the probability of transporting between source and target samples, forms the basis for our transportation cost. We provide proof that the proposed operator is in fact a diffusion operator, demonstrating that the cost function is defined by an anisotropic diffusion process between the domains. In addition, to enhance performance, we integrate source labels into the operator, thereby guiding the anisotropic diffusion according to the classes. We showcase the effectiveness of Diffusion-OT through comprehensive experiments, demonstrating its superior performance compared to recent methods across various benchmarks and datasets.

34Improved Convergence Rate for Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of d1/3ε2/3d^{1/3}\varepsilon^{-2/3}, which is better than d5/12ε1d^{5/12}\varepsilon^{-1}, the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling \citep{Shen2019TheRandomized}, and then extended to diffusion models by \citet{Gupta2024Faster}. Our theory accommodates ε\varepsilon-accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only O(log2(d/ε))O(\log^2(d/\varepsilon)) parallel rounds in a similar way to prior works.

35Exploration by Running Away from the Past

[openreview] [pdf]

Abstract The ability to explore efficiently and effectively is a central challenge of reinforcement learning. In this work, we consider exploration through the lens of information theory. Specifically, we cast exploration as a problem of maximizing the Shannon entropy of the state occupation measure. This is done by maximizing a sequence of divergences between distributions representing an agent’s past behavior and its current behavior. Intuitively, this encourages the agent to explore new behaviors that are distinct from past behaviors. Hence, we call our method RAMP, for ``R\textbf{R}unning A\textbf{A}way from\textbf{m} the P\textbf{P}ast.‘’ A fundamental question of this method is the quantification of the distribution change over time. We consider both the Kullback-Leibler divergence and the Wasserstein distance to quantify divergence between successive state occupation measures, and explain why the former might lead to undesirable exploratory behaviors in some tasks. We demonstrate that by encouraging the agent to explore by actively distancing itself from past experiences, it can effectively explore mazes and a wide range of behaviors on robotic manipulation and locomotion tasks.

36Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach

[openreview] [pdf]

Abstract Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard O(1/ϵ2)\mathcal O(1/\epsilon^2) rate of vanilla diffusion models, where ε denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density q0q_0, which is far more relaxed than the existing smoothness conditions posed to all qtq_t along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension dd. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie’s formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.

37Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

[openreview] [pdf]

Abstract Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with “unlearning” steps (to “forget” existing concepts, such as copyrighted data or the ability to generate explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to “relearn” concepts that were previously “unlearned.” We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments based on fine-tuning Stable Diffusion v1.4 alongside “mass concept erasure”, the current state of the art for unlearning in text-to-image diffusion models (Lu et al., 2024). Our findings underscore the fragility of composing incremental model updates, and raise new serious concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.

38How to Find the Exact Pareto Front for Multi-Objective MDPs?

[openreview] [pdf]

Abstract Multi-objective Markov Decision Processes (MDPs) are receiving increasing attention, as real-world decision-making problems often involve conflicting objectives that cannot be addressed by a single-objective MDP. The Pareto front identifies the set of policies that cannot be dominated, providing a foundation for finding Pareto optimal solutions that can efficiently adapt to various preferences. However, finding the Pareto front is a highly challenging problem. Most existing methods either (i) rely on traversing the continuous preference space, which is impractical and results in approximations that are difficult to evaluate against the true Pareto front, or (ii) focus solely on deterministic Pareto optimal policies, from which there are no known techniques to characterize the full Pareto front. Moreover, finding the structure of the Pareto front itself remains unclear even in the context of dynamic programming, where the MDP is fully known in advance. In this work, we address the challenge of efficiently discovering the Pareto front. By investigating the geometric structure of the Pareto front in MO-MDP, we uncover a key property: the Pareto front is on the boundary of a convex polytope whose vertices all correspond to deterministic policies, and neighboring vertices of the Pareto front differ by only one state-action pair of the deterministic policy, almost surely. This insight transforms the global comparison across all policies into a localized search among deterministic policies that differ by only one state-action pair, drastically reducing the complexity of searching for the exact Pareto front. We develop an efficient algorithm that identifies the vertices of the Pareto front by solving a single-objective MDP only once and then traversing the edges of the Pareto front, making it more efficient than existing methods. Furthermore, the entire Pareto front can be found in VV iterations, where VV represents the number of vertices on the Pareto front. Our empirical studies demonstrate the effectiveness of our theoretical strategy in discovering the Pareto front efficiently.

39APCtrl: Adding Conditional Control to Diffusion Models by Alternative Projection

[openreview] [pdf]

Abstract Enhancing the versatility of pretrained diffusion models through advanced conditioning techniques is crucial for improving their applicability. We present APCtrl, a novel conditional image generation approach that formulates the latent ( \dmrv{z}\dms{t} ) at timestep ( t ) as the projection ( \dmrv{z}\dms{t} = \text{Proj}{\bmfrakD\dms{t}} (\dmrv{z}{ \dms{t} + \dms{1} }) ) onto the denosing set ( \bmfrakD\dms{t} ). For conditional control, APCtrl integrates the condition set ( \bmfrakC_\dms{t} ), defined by a latent control network (\bmcalA_{\dmv{theta}}(\cdot, \cdot)). Our method simplifies conditional sampling to recursive projections ( \dmrv{z}\dms{t} = \text{Proj}{\bmfrakI_\dms{t}} \circ \text{Proj}{\bmfrakD\dms{t}} (\dmrv{z}_{ \dms{t} + \dms{1} }) ), where each projection step integrates both the diffusion and condition priors. By employing Alternative Projection, our approach offers several key advantages: 1. Multi-Condition Generation: easily expandable with additional conditional sets; 2. Model and Sampling Agnosticism: works with any model or sampling method; 3. Unified Control Loss: simplifies the management of diverse control applications; 4. Efficiency: delivers comparable control with reduced training and sampling times. Extensive experiments demonstrate the superior performance of our method.

40Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning

[openreview] [pdf]

Abstract Large-scale conditional diffusion models (DMs) have demonstrated exceptional ability in generating high-quality images from textual descriptions, gaining widespread use across various domains. However, these models also carry the risk of producing harmful, sensitive, or copyrighted content, creating a pressing need to remove such information from their generation capabilities. While retraining from scratch is prohibitively expensive, machine unlearning provides a more efficient solution by selectively removing undesirable knowledge while preserving utility. In this paper, we introduce \textbf{COncept REconditioning (CORE)}, a simple yet effective approach for unlearning diffusion models. Similar to some existing approaches, CORE guides the noise predictor conditioned on forget concepts towards an anchor generated from alternative concepts. However, CORE introduces key differences in the choice of anchor and retain loss, which contribute to its enhanced performance. We evaluate the unlearning effectiveness and retainability of CORE on UnlearnCanvas. Extensive experiments demonstrate that CORE surpasses state-of-the-art methods including its close variants and achieves near-perfect performance, especially when we aim to forget multiple concepts. More ablation studies show that CORE’s careful selection of the anchor and retain loss is critical to its superior performance.

41Enhancing Dataset Distillation with Concurrent Learning: Addressing Negative Correlations and Catastrophic Forgetting in Trajectory Matching

[openreview] [pdf]

Abstract Dataset distillation generates a small synthetic dataset on which a model is trained to achieve performance comparable to that obtained on a complete dataset. Current state-of-the-art methods primarily focus on Trajectory Matching (TM), which optimizes the synthetic dataset by matching its training trajectory with that from the real dataset. Due to convergence issues and numerical stability, it is impractical to match the entire trajectory in one go; typically, a segment is sampled for matching at each iteration. However, previous TM-based methods overlook the potential interactions between matching different segments, particularly the presence of negative correlations. To study this problem, we conduct a quantitative analysis of the correlation between matching different segments and discover varying degrees of negative correlation depending on the image per class (IPC). Such negative correlation could lead to an increase in accumulated trajectory error and transform trajectory matching into a continual learning paradigm, potentially causing catastrophic forgetting. To tackle this issue, we propose a concurrent learning-based trajectory matching that simultaneously matches multiple segments. Extensive experiments demonstrate that our method consistently surpasses previous TM-based methods on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-1K.

42Discrete Distribution Networks

[openreview] [pdf]

Abstract We introduce a novel generative model, the Discrete Distribution Networks (DDN), that approximates data distribution using hierarchical discrete distributions. We posit that since the features within a network inherently capture distributional information, enabling the network to generate multiple samples simultaneously, rather than a single output, may offer an effective way to represent distributions. Therefore, DDN fits the target distribution, including continuous ones, by generating multiple discrete sample points. To capture finer details of the target data, DDN selects the output that is closest to the Ground Truth (GT) from the coarse results generated in the first layer. This selected output is then fed back into the network as a condition for the second layer, thereby generating new outputs more similar to the GT. As the number of DDN layers increases, the representational space of the outputs expands exponentially, and the generated samples become increasingly similar to the GT. This hierarchical output pattern of discrete distributions endows DDN with unique property: more general zero-shot conditional generation. We demonstrate the efficacy of DDN and its intriguing properties through experiments on CIFAR-10 and FFHQ.

43Anti-Exposure Bias in Diffusion Models via Prompt Learning

[openreview] [pdf]

Abstract Diffusion models (DMs) have achieved record-breaking performance in image generation tasks. Nevertheless, in practice, the training-sampling discrepancy, caused by score estimation error and discretization error, limits the modeling ability of DMs, a phenomenon known as exposure bias. To alleviate such exposure bias and further improve the generative performance, we put forward a prompt learning framework built upon a lightweight prompt prediction model. Concretely, our model learns an anti-bias prompt for the generated sample at each sampling step, aiming to compensate for the exposure bias that arises. Following this design philosophy, our framework rectifies the sampling trajectory to match the training trajectory, thereby reducing the divergence between the target data distribution and the modeling distribution. To train the prompt prediction model, we simulate exposure bias by constructing training data and introduce a time-dependent weighting function for optimization. Empirical results on various DMs demonstrate the superiority of our prompt learning framework across three benchmark datasets. Importantly, the optimized prompt prediction model effectively improves image quality with only a 5% increase in sampling overhead, which remains negligible.

44Direct Distributional Optimization for Provable Alignment of Diffusion Models

[openreview] [pdf]

Abstract We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over probability distributions and directly optimize the distribution using the Dual Averaging method. Next, we enable sampling from the learned distribution by approximating its score function via Doob’s hh-transform technique. The proposed framework is supported by rigorous convergence guarantees and an end-to-end bound on the sampling error, which imply that when the original distribution’s score is known accurately, the complexity of sampling from shifted distributions is independent of isoperimetric conditions. This framework is broadly applicable to general distribution optimization problems, including alignment tasks in Reinforcement Learning with Human Feedback (RLHF), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO). We empirically validate its performance on synthetic and image datasets using the DPO objective.

45Stabilizing the Kumaraswamy Distribution

[openreview] [pdf]

Abstract Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the inverse CDF and log-pdf, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models based on the KS, improving exploration-exploitation trade-offs in contextual multi-armed bandits and enhancing uncertainty quantification for link prediction with graph neural networks. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.

46Dynamic Negative Guidance of Diffusion Models

[openreview] [pdf]

Abstract Negative Prompting (NP) is widely utilized in diffusion models, particularly in text-to-image applications, to prevent the generation of undesired features. In this paper, we show that conventional NP is limited by the assumption of a constant guidance scale, which may lead to highly suboptimal results, or even complete failure, due to the non-stationarity and state-dependence of the reverse process. Based on this analysis, we derive a principled technique calledDynamicNegativeGuidance, which relies on a near-optimal time and state dependent modulation of the guidance without requiring additional training. Unlike NP, negative guidance requires estimating the posterior class probability during the denoising process, which is achieved with limited additional computational overhead by tracking the discrete Markov Chain during the generative process. We evaluate the performance of DNG class-removal on MNIST and CIFAR10, where we show that DNG leads to higher safety, preservation of class balance and image quality when compared with baseline methods. Furthermore, we show that it is possible to use DNG with Stable Diffusion to obtain more accurate and less invasive guidance than NP.

47One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

[openreview] [pdf]

Abstract Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only 2%-10% additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided athttps://drive.google.com/file/d/1eIa11gw6DwYKG9CKERy41bjE1ruklRtT/view?usp=sharing, and the code will be publicly available soon.

48Dynamic Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its compu- tation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial- wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning it- erations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73×, and achieves a competitive FID score of 2.07 on ImageNet.

49Data Unlearning in Diffusion Models

[openreview] [pdf]

Abstract Recent work has shown that diffusion models memorize and reproduce training data examples. At the same time, large copyright lawsuits and legislation such as GDPR have highlighted the need for erasing datapoints from diffusion models. However, retraining from scratch is often too expensive. This motivates the setting of data unlearning, i.e., the study of efficient techniques for unlearning specific datapoints from the training set. Existing concept unlearning techniques require an anchor prompt/class/distribution to guide unlearning, which is not available in the data unlearning setting. General-purpose machine unlearning techniques were found to be either unstable or failed to unlearn data. We therefore propose a family of new loss functions called Subtracted Importance Sampled Scores (SISS) that utilize importance sampling and are the first method to unlearn data with theoretical guarantees. SISS is constructed as a weighted combination between simpler objectives that are responsible for preserving model quality and unlearning the targeted datapoints. When evaluated on CelebA-HQ and MNIST, SISS achieved Pareto optimality along the quality and unlearning strength dimensions. On Stable Diffusion, SISS successfully mitigated memorization on nearly 90% of the prompts we tested. We release our code online.

50Towards a Theoretical Understanding of Memorization in Diffusion Models

[openreview] [pdf]

Abstract As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50% more effective across different scales of the CelebA dataset.

51RETHINK MAXIMUM STATE ENTROPY

[openreview] [pdf]

Abstract In the absence of specific tasks or extrinsic reward signals, a key objective for an agent is the efficient exploration of its environment. A widely adopted strategy to achieve this is maximizing state entropy, which encourages the agent to uniformly explore the entire state space. Most existing approaches for maximum state entropy (MaxEnt) are rooted in two foundational approaches, which were proposed by Hazan and Liu & Abbeel, respectively. However, a unified perspective on these methods is lacking within the community.In this paper, we analyze these two foundational approaches within a unified framework and demonstrate that both methods share the same reward function when employing the kkNN density estimator. We also show that the η-based policy sampling method proposed by Hazan is unnecessary and that the primary distinction between the two lies in the frequency with which the locally stationary reward function is updated. Building on this analysis, we introduce MaxEnt-(V)eritas, which combines the most effective components of both methods: iteratively updating the reward function as defined by Liu & Abbeel, and training the agent until convergence before updating the reward functions, akin to the procedure used by Hazan. We prove that MaxEnt-V is an efficient ε\varepsilon-optimal algorithm for maximizing state entropy, where the tolerance ε\varepsilon decreases as the number of iterations increases. Empirical validation in three Mujoco environments shows that MaxEnt-Veritas significantly outperforms the two MaxEnt frameworks in terms of both state coverage and state entropy maximization, with sound explanations for these results.

52Conditional Information Bottleneck Approach for Out-of-Distribution Sequential Recommendation

[openreview] [pdf]

Abstract Sequential recommendation (SR) aims to suggest items users are most likely to engage with next based on their past interactions. However, in practice, SR systems often face the out-of-distribution (OOD) problem due to dynamic environmental factors (e.g., seasonal changes), leading to significant performance degradation in the testing phase. Some methods incorporate distributionally robust optimization (DRO) into SR to alleviate OOD, but the sparsity of SR data challenges this. Other approaches use random data augmentations to explore the OOD, potentially distorting important information, as user behavior is personalized rather than random. Additionally, they often overlook users’ varying sensitivity to distribution shifts during the exploration, which is crucial for capturing the evolution of user preferences in OOD contexts. In this work, inspired by information bottleneck theory (IB), we propose the Conditional Distribution Information Bottleneck (CDIB), a novel objective that creates diverse OOD distributions while preserving minimal sufficient information regarding the origin distribution conditioned on the user. Building on this, we introduce a framework with a learnable, personalized data augmentation method using a mask-then-generate paradigm to craft diverse and reliable OOD distributions optimized with CDIB. Experiments on four real-world datasets show our model consistently outperforms baselines. The code is available athttps://anonymous.4open.science/r/CDIB-51C8.

53Backtracking Improves Generation Safety

[openreview] [pdf]

Abstract Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

54Optimizing Latent Goal by Learning from Trajectory Preference

[openreview] [pdf]

Abstract A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent’s behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named \emph{\textbf{P}reference \textbf{G}oal \textbf{T}uning} (PGT). PGT allows policies to interact with the environment to collect several trajectories, which will be categorized into positive and negative examples based on preference. A preference optimization algorithm is used to fine-tune the initial goal latent representation using the collected trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.072.0% and 81.681.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.413.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for Continual Learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.

55Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model

[openreview] [pdf]

Abstract Recent advancements in diffusion models have revolutionized generative modeling. However, the impressive and vivid outputs they produce often come at the cost of significant model scaling and increased computational demands. Consequently, building personalized diffusion models based on off-the-shelf models has emerged as an appealing alternative. In this paper, we introduce a novel perspective on conditional generation for transferring a pre-trained model. From this viewpoint, we propose \textit{Domain Guidance}, a straightforward transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain. Domain Guidance shares a formulation similar to advanced classifier-free guidance, facilitating better domain alignment and higher-quality generations. We provide both empirical and theoretical analyses of the mechanisms behind Domain Guidance. Our experimental results demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 20.6% improvement in FDDINOv2_\text{DINOv2} compared to standard fine-tuning. Notably, existing fine-tuned models can seamlessly integrate Domain Guidance to leverage these benefits, without additional training.

56Distilled Diffusion Language Models

[openreview] [pdf]

Abstract Transformer-based Large Language Models (LLMs) have demonstrated remarkable capa- bilities, yet their autoregressive nature forces sequential token-by-token decoding, leading to inefficiencies during inference. Furthermore, autoregressive language models lack in- herent self-correction abilities, which hinders their capacity to refine and improve gener- ated content without relying on external prompting or retraining techniques. In contrast, diffusion-based models offer the advantage of fast parallel generation through iterative refinement, while leveraging bi-directional attention to utilize full context at once. How- ever, diffusion models are unable to match their autoregressive counterparts. This moti- vates us to explore the possibility of distilling a pre-trained autoregressive (AR) language model (teacher) into a non-autoregressive diffusion (non-AR) language model (student), combining the best of both worlds. In this work, we present Target Concrete Score (TCS) distillation, a theoretically grounded framework that bridges autoregressive and diffusion paradigms. TCS distillation is broadly applicable to both discrete and continuous diffu- sion models, with any pre-trained autoregressive teacher model. We propose techniques to make TCS distillation scalable and efficient for transformer-based models, and show how it can both improve pre-trained diffusion language models and also train new mod- els from scratch. Through comprehensive experiments on language modeling tasks, we demonstrate the effectiveness of our proposed methods.

57Revamping Diffusion Guidance for Conditional and Unconditional Generation

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) has become the standard method for enhancing the quality of conditional diffusion models. However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pre-trained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called time-step guidance (TSG), which can be applied toanydiffusion model, including unconditional ones. Our guidance techniques are easy to implement and have the same sampling cost as CFG. Through extensive experiments, we demonstrate that ICG matches the performance of standard CFG across various conditional diffusion models. Moreover, we show that TSG improves generation quality in a manner similar to CFG, without relying on any conditional information.

58Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

[openreview] [pdf]

Abstract Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They heavily rely on the task-specific labeled data which can be difficult to acquire. To address these challenges, we aim to develop a versatile diffusion planner that can leverage large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to fast refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.

59Can the Training Loss be Predictive for Out-of-Distribution Generalization?

[openreview] [pdf]

Abstract Traditional model selection in deep learning relies on carefully tuning several hyper-parameters (HPs) controlling regularization strength on held-out validation data, which can be challenging to obtain in scarce-data scenarios or may not accurately reflect real-world deployment conditions due to distribution shifts. Motivated by such issues, this paper investigates the potential of using solely the training loss to predict the generalization performance of neural networks on out-of-distribution (OOD) test scenarios. Our analysis reveals that preserving consistent prediction variance across training and testing distributions is essential for establishing a correlation between training loss and OOD generalization. We propose architectural adjustments to ensure variance preservation\textit{variance preservation}, enabling reliable model selection based on training loss alone, even in over-parameterized settings with a sample-to-parameter ratio exceeding four orders of magnitude. We extensively assess the model-selection capabilities of variance-preserving\textit{variance-preserving} architectures on several scarce data, domain-shift, and corruption benchmarks by optimizing HPs such as learning rate, weight decay, batch size, and data augmentation strength.

60Balancing Domain-Invariant and Domain-Specific Knowledge for Domain Generalization with Online Knowledge Distillation

[openreview] [pdf]

Abstract Deep learning models often experience performance degradation when the distribution of testing data differs from that of training data. Domain generalization addresses this problem by leveraging knowledge from multiple source domains to enhance model generalizability. Recent studies have shown that distilling knowledge from large pretrained models effectively improves a model’s ability to generalize to unseen domains. However, current knowledge distillation-based domain generalization approaches overlook the importance of domain-specific knowledge and rely on a two-stage training process, which limits the effectiveness of knowledge transfer. To overcome these limitations, we propose the Balanced Online knowLedge Distillation (BOLD) framework for domain generalization. BOLD employs a multi-domain expert teacher model, with each expert specializing in specific source domains to preserve domain-specific knowledge. This approach enables the student to distil both domain-invariant and domain-specific knowledge from the teacher. Additionally, BOLD adopts an online knowledge distillation strategy where the teacher and students learn simultaneously, allowing the teacher to adapt based on the student’s feedback, thereby enhancing knowledge transfer and improving the student’s generalizability. Extensive experiments conducted with state-of-the-art baselines on seven domain generalization benchmarks demonstrate the effectiveness of the BOLD framework. We also provide a theoretical analysis that underscores the effectiveness of domain-specific knowledge and the online knowledge distillation strategy in domain generalization.

61Heavy-Tailed Diffusion Models

[openreview] [pdf]

Abstract Diffusion models achieve state-of-the-art generation quality across many applications, but their ability to capture rare or extreme events in heavy-tailed distributions remains unclear. In this work, we show that traditional diffusion and flow-matching models with standard Gaussian priors fail to accurately capture heavy-tailed behavior. We address this by repurposing the diffusion framework for heavy-tail estimation using multivariate Student-t distributions. We develop a tailored perturbation kernel and derive the denoising posterior based on the conditional Student-t distribution for the backward process. Inspired by γ-divergence for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers. The resulting framework introduces controllable tail generation using only a single scalar hyperparameter, making it easily tunable for diverse real-world distributions. As specific instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing diffusion and flow models that employ a Student-t prior. Remarkably, our approach is readily compatible with standard Gaussian diffusion models and requires only minimal code changes. Empirically, we show that our t-EDM and t-Flow outperform standard diffusion models in heavy-tail estimation on high-resolution weather datasets in which generating rare and extreme events is crucial.

62Variational Search Distributions

[openreview] [pdf]

Abstract We develop variational search distributions (VSD), a method for finding discrete, combinatorial designs of a rare desired class in a batch sequential manner with a fixed experimental budget. We formalize the requirements and desiderata for this problem and formulate a solution via variational inference. In particular, VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various biological systems.

63Diffusion Modulation via Environment Mechanism Modeling for Planning

[openreview] [pdf]

Abstract Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

64Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generating high-dimensional samples across various applications. While the theoretical guarantees for continuous-state diffusion models have been extensively studied, the convergence analysis of the discrete-state counterparts remains under-explored. In this paper, we study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework. We introduce a discrete-time sampling algorithm in the general state space [S]d[S]^d that utilizes score estimators at predefined time points. We derive convergence bounds for the Kullback-Leibler (KL) divergence and total variation (TV) distance between the generated sample distribution and the data distribution, considering both scenarios with and without early stopping under specific assumptions. Notably, our KL divergence bounds are nearly linear in dimension dd, aligning with state-of-the-art results for diffusion models. Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function, which are essential for characterizing the discrete-time sampling process.

65Energy-Based Conceptual Diffusion Model

[openreview] [pdf]

Abstract Diffusion models have shown impressive sample generation capabilities across various domains. However, current methods are still lacking in human-understandable explanations and interpretable control: (1) they do not provide a probabilistic framework for systematic interpretation. For example, when tasked with generating an image of a “Nighthawk”, they cannot quantify the probability of specific concepts (e.g., “black bill” and “brown crown” usually seen in Nighthawks) or verify whether the generated concepts align with the instruction. This limits explanations of the generative process; (2) they do not naturally support control mechanisms based on concept probabilities, such as correcting errors (e.g., correcting “black crown” to “brown crown” in a generated “Nighthawk” image) or performing imputations using these concepts, therefore falling short in interpretable editing capabilities. To address these limitations, we propose Energy-based Conceptual Diffusion Models (ECDMs). ECDMs integrate diffusion models and Concept Bottleneck Models (CBMs) within the framework of Energy-Based Models to provide unified interpretations. Unlike conventional CBMs, which are typically discriminative, our approach extends CBMs to the generative process. ECDMs use a set of energy networks and pretrained diffusion models to define the joint energy estimation of the input instructions, concept vectors, and generated images. This unified framework enables concept-based generation, interpretation, debugging, intervention, and imputation through conditional probabilities derived from energy estimates. Our experiments on various real-world datasets demonstrate that ECDMs offer both strong generative performance and rich concept-based interpretability.

66Diversity-Rewarded CFG Distillation

[openreview] [pdf]

Abstract Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations athttps://musicdiversity.github.io/.

67EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

[openreview] [pdf]

Abstract Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

68Diffusion-Based Planning for Autonomous Driving with Flexible Guidance

[openreview] [pdf]

Abstract Achieving human-like driving behaviors in complex open-world environments is a critical challenge in autonomous driving. Contemporary learning-based planning approaches such as imitation learning methods often struggle to balance competing objectives and lack of safety assurance,due to limited adaptability and inadequacy in learning complex multi-modal behaviors commonly exhibited in human planning, not to mention their strong reliance on the fallback strategy with predefined rules. We propose a novel transformer-based Diffusion Planner for closed-loop planning, which can effectively model multi-modal driving behavior and ensure trajectory quality without any rule-based refinement. Our model supports joint modeling of both prediction and planning tasks under the same architecture, enabling cooperative behaviors between vehicles. Moreover, by learning the gradient of the trajectory score function and employing a flexible classifier guidance mechanism, Diffusion Planner effectively achieves safe and adaptable planning behaviors. Evaluations on the large-scale real-world autonomous planning benchmark nuPlan and our newly collected 200-hour delivery-vehicle driving dataset demonstrate that Diffusion Planner achieves state-of-the-art closed-loop performance with robust transferability in diverse driving styles.

69Satisficing Exploration in Bandit Optimization

[openreview] [pdf]

Abstract Motivated by the concept of satisficing in decision-making, we consider the problem of satisficing exploration in bandit optimization. In this setting, the learner aims at finding a satisficing arm whose mean reward exceeds a certain threshold. The performance is measured by satisficing regret, which is the cumulative deficit of the chosen arm’s mean reward compared to the threshold. We propose SELECT, a general algorithmic template for Satisficing Exploration via LowEr Confidence bound Testing, that attains constant satisficing regret for a wide variety of bandit optimization problems in the realizable case (i.e., whenever a satisficing arm exists). Specifically, given a class of bandit optimization problems and a corresponding learning oracle with sub-linear (standard) regret upper bound, SELECT iteratively makes use of the oracle to identify a potential satisficing arm. Then, it collects data samples from this arm, and continuously compares the lower confidence bound of the identified arm’s mean reward against the threshold value to determine if it is a satisficing arm. As a complement, SELECT also enjoys the same (standard) regret guarantee as the oracle in the non-realizable case. Finally, we conduct numerical experiments to validate the performance of SELECT for several popular bandit optimization settings.

70Longitudinal Latent Diffusion Models

[openreview] [pdf]

Abstract Longitudinal data are crucial in several fields, but collecting them is a challenging process, often hindered by concerns such as individual privacy. Extrapolating in time initial trajectories or generating fully synthetic sequences could address these issues and prove valuable in clinical trials, drug design, and even public policy evaluation. We propose a generative statistical model for longitudinal data that links the temporal dependence of a sequence to a latent diffusion model and leverages the geometry of the autoencoder latent space. This versatile method can be used for several tasks - prediction, generation, oversampling - effectively handling high-dimensional data such as images and irregularly-measured sequences, needing only relatively few training samples. Thanks to its ability to generate sequences with controlled variability, it outperforms previously proposed methods on datasets of varying complexity, while remaining interpretable.

71Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

[openreview] [pdf]

Abstract Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization—where models inadvertently replicate exact or near-identical training data—has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random training sample pairs. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation. Our code is available at \url{https://anonymous.4open.science/r/TabCutMix-3F7B}.

72Principle Counterfactual Fairness

[openreview] [pdf]

Abstract Fairness in human and algorithmic decision-making is crucial in areas such as criminal justice, education, and social welfare. Recently, counterfactual fairness has drawn increasing research interest, suggesting that decision-making for individuals should remain the same when intervening with different values on the protected attributes. Nevertheless, the question of “which attributes and individuals should be protected” is rarely discussed in the existing counterfactual fairness literature. For example, when considering leg disability as a protected attribute, the algorithms should not treat individuals with leg disabilities differently in college admissions, but one may naturally take into this factor for the purpose of selecting runner athletes. In other words, when and how to enforce fairness is expected to depend on the causal relation between the protected attribute and the outcome of interest. Formally, this paper proposes principal counterfactual fairness using the concept of principal stratification from the causal inference literature, focusing on whether an algorithm is counterfactually fair for individuals whose protected attribute has no individual causal effect on the outcome of interest. To examine whether an algorithm satisfies principal counterfactual fairness, we derive the statistical bounds, and propose a post-processing approach to achieving principal counterfactual fairness with minimal individual decision changes. Experiments are conducted using synthetic and real-world datasets to verify the effectiveness of our methods.

73Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers

[openreview] [pdf]

Abstract Score-based diffusion models have emerged as powerful techniques for generating samples from high-dimensional data distributions. These models involve a two-phase process: first, injecting noise to transform the data distribution into a known prior distribution, and second, sampling to recover the original data distribution from noises. Among the various sampling methods, deterministic samplers stand out for their enhanced efficiency. However, analyzing these deterministic samplers presents unique challenges, as they preclude the use of established techniques such as Girsanov’s theorem, which are only applicable to stochastic samplers. Furthermore, existing analysis for deterministic samplers usually focuses on some specific examples, lacking a generalized approach for general forward processes and various deterministic samplers. Our paper addresses these limitations by introducing a unified convergence analysis framework. To demonstrate the power of our framework, we analyze the variance-preserving (VP) forward process with the exponential integrator (EI) scheme, and achieved iteration complexity of O~(d2/ϵ)\tilde{O}(d^2/\epsilon). Additionally, we provide a detailed analysis of DDIM-type samplers, which have been underexplored in previous research, achieving polynomial iteration complexity.

74Improving Discrete Diffusion with Schedule-Conditioning

[openreview] [pdf]

Abstract In research on discrete diffusion generative models, one long-standing mystery is the dominance of the masking state corruption process. In masking diffusion, all data points collapse to a sequence of mask tokens without any transitions between non-mask tokens, ruling out small edits from one unmasked token to another. By contrast, in image modeling, the dominant corruption process is Gaussian noise, which encourages gradual movements in pixel space. In this paper, we propose that masking diffusion dominates due to knowledge of when corruptions occurred. When it makes predictions, it does so conditional on the schedule of previous corruptions; this allows it to devote less capacity to inferring whether a corruption has occurred and more capacity to modeling relationships between tokens. We use this insight to build knowledge of corruptions into other discrete diffusion models; we call our method schedule-conditioned diffusion (SCUD). We show that SCUD generalizes classical discrete diffusion and masking diffusion. We show that applying SCUD to models with different corruption processes leads to improved perplexities on images, text, and protein sequences; Finally, by applying SCUD to models with corruption processes with ``gradual’’ structure, we build diffusion models that outperform masking.

75Preference Diffusion for Recommendation

[openreview] [pdf]

Abstract Recommender systems predict personalized item rankings based on user preference distributions derived from historical behavior data. Recently, diffusion models (DMs) have gained attention in recommendation for their ability to model complex distributions, yet current DM-based recommenders often rely on traditional objectives like mean squared error (MSE) or recommendation objectives, which are not optimized for personalized ranking tasks or fail to fully leverage DM’s generative potential. To address this, we propose PreferDiff, a tailored optimization objective for DM-based recommenders. PreferDiff transforms BPR into a log-likelihood ranking objective and integrates multiple negative samples to better capture user preferences. Specifically, we employ variational inference to handle the intractability through minimizing the variational upper bound and replaces MSE with cosine error to improve alignment with recommendation tasks. Finally, we balance learning generation and preference to enhance the training stability of DMs. PreferDiff offers three key benefits: it is the first personalized ranking loss designed specifically for DM-based recommenders and it improves ranking and faster convergence by addressing hard negatives. We also prove that it is theoretically connected to Direct Preference Optimization which indicates that it has the potential to align user preferences in DM-based recommenders via generative modeling. Extensive experiments across three benchmarks validate its superior recommendation performance and commendable general sequential recommendation capabilities. Our codes are available at \url{https://anonymous.4open.science/r/PreferDiff}.

76Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles

[openreview] [pdf]

Abstract Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as shortcut learning, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose DiffDiv an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) to mitigate this form of bias. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification on par with prior work that relies on auxiliary data collection.

77HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?

[openreview] [pdf]

Abstract Accurately forecasting multiple future events within a given time horizon is crucial for applications in finance, retail, social networks, and healthcare. Event timing and labels are typically modeled using Marked Temporal Point Processes (MTPP), with evaluations often focused on next-event prediction quality. While some studies have extended evaluations to a fixed number of future events, we demonstrate that this approach leads to inaccuracies in handling false positives and false negatives. To address these issues, we propose a novel evaluation method inspired by object detection techniques from computer vision. Specifically, we introduce Temporal mean Average Precision (T-mAP), a temporal variant of mAP, which overcomes the limitations of existing long-horizon evaluation metrics. Our extensive experiments demonstrate that models with strong next-event prediction accuracy can yield poor long-horizon forecasts, and vice versa, indicating that specialized methods are needed for each task. To support further research, we release HoTPP, the first benchmark specifically designed for evaluating long-horizon MTPP predictions. HoTPP includes large-scale datasets with up to 43 million events and provides optimized procedures for both autoregressive and parallel inference, paving the way for future advancements in the field.

78Zigzag Diffusion Sampling: The Path to Success ls Zigzag

[openreview] [pdf]

Abstract Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we theoretically and empirically demonstrate that the conditional guidance gap between the denoising and inversion processes captures prompt-related semantic information. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel sampling method that leverages the guidance gap to accumulate semantic information step-by-step throughout the entire generation process, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding costs. Third, extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. Particularly, Z-Sampling is good at handling those challenging fine-grained prompts, such as style, position, counting, and multiple objects, due to its guidance-gap-based information gain. Moreover, Z-Sampling can even further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.

79RTDiff: Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning (RL), managing the distribution shift between the learned policy and the static offline dataset is a persistent challenge that can result in overestimated values and suboptimal policies. Traditional offline RL methods address this by introducing conservative biases that limit exploration to well-understood regions, but they often overly restrict the agent’s generalization capabilities. Recent work has sought to generate trajectories using generative models to augment the offline dataset, yet these methods still struggle with overestimating synthesized data, especially when out-of-distribution samples are produced. To overcome this issue, we propose RTDiff, a novel diffusion-based data augmentation technique that synthesizes trajectoriesin reverse, moving from unknown to known states. Such reverse generation naturally mitigates the risk of overestimation by ensuring that the agent avoids planning through unknown states. Additionally, reverse trajectory synthesis allows us to generate longer, more informative trajectories that take full advantage of diffusion models’ generative strengths while ensuring reliability. We further enhance RTDiff by introducing flexible trajectory length control and improving the efficiency of the generation process through noise management. Our empirical results show that RTDiff significantly improves the performance of several state-of-the-art offline RL algorithms across diverse environments, achieving consistent and superior results by effectively overcoming distribution shift.

80Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data

[openreview] [pdf]

Abstract Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask:“Can training data be extracted from these fine-tuned DMs shared online?”A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution---from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance. The code is available at an anonymous link.

81CONCORD: Concept-informed Diffusion for Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation has witnessed significant progress in synthesizing small-scale datasets that encapsulate rich information from large-scale original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking conceptual completeness at the instance level. This oversight can result in missing or incorrectly represented object details and compromised dataset quality. To this end, we propose to incorporate the conceptual understanding of large language models (LLMs) to perform a CONCept-infORmed Diffusion process for dataset distillation, in short as CONCORD. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to explicitly inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without replying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. It further advances the practical application of dataset distillation methods. The code implementation is attached in the supplementary material.

82Cohesion: Coherence-Based Diffusion for Long-Range Dynamics Forecasting

[openreview] [pdf]

Abstract We recast existing works on probabilistic dynamics forecasting through a unified framework connecting turbulence and diffusion principles: Cohesion. Specifically, we relate the coherent part of nonlinear dynamics as a conditioning prior in a denoising process, which can be efficiently estimated using reduced-order models. This fast generation of long prior sequences allows us to reframe forecasting as trajectory planning, a common task in RL. This reformulation is beneficial because we can perform a single conditional denoising pass for an entire sequence, rather than autoregressively over long lead time, gaining orders-of-magnitude speedups with little performance loss. Nonetheless, Cohesion supports flexibility through temporal composition that allows iterations to be performed over smaller subsequences, with autoregressive being a special case. To ensure temporal consistency within and between subsequences, we incorporate a model-free, small receptive window via temporal convolution that leverages large NFEs during denoising. Finally, we perform our guidance in a classifier-free manner to handle a broad range of conditioning scenarios for zero-shot forecasts. Our experiments demonstrate that Cohesion outperforms state-of-the-art probabilistic emulators for chaotic systems over long lead time, including in Kolmogorov Flow and Shallow Water Equation. Its low spectral divergence highlights Cohesion’s ability to resolve multi-scale physical structures, even in partially-observed cases, and are thus essential for long-range, high-fidelity, physically-realistic emulation.

[openreview] [pdf]

Abstract Machine Learning (ML) has advanced Combinatorial Optimization (CO), especially for one of the most focused problems, the Travelling Salesman Problem (TSP). While certain methods demonstrate promising performance, they still fall short compared to mathematical solvers. This study utilizes TSP as a case study, dissecting established mainstream learning-based solvers to outline a comprehensive design space. It advances a unified modular streamline incorporating existing technologies in both learning and search for transparent ablation, aiming to reassess the role of learning and to discern which parts of existing techniques are genuinely beneficial and which are not. This further leads to the investigation of desirable principles of learning designs and the exploration of concepts guiding method designs. We demonstrate the desirability of principles such as joint probability estimation, symmetry solution representation, and online optimization for learning-based designs. Leveraging the findings, we propose enhancements to existing methods to compensate for their missing attributes, thereby advancing performance and enriching the technique library. From a higher viewpoint, we also uncover a performance advantage in non-autoregressive and supervised paradigms compared to their counterparts. The strategic decoupling and organic recompositions yield a factory of new TSP solvers, where we investigate synergies across various method combinations and pinpoint the optimal design choices to create more powerful ML4TSP solvers, thereby facilitating and offering a reference for future research and engineering endeavors. Source code will be made publicly available.

84Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve asingle turnof interaction. However, they can still struggle withmulti-turntasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer fromcovariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL, outperforms Llama-3.1-70B-it on long multi-turn dialogues.

85Beyond Predefined Depots: A Dual-Mode Generative DRL Framework for Proactive Depot Generation in Location-Routing Problem

[openreview] [pdf]

Abstract The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots’ geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.

86Distributional Reinforcement Learning Based On Historical Information For Option Hedging

[openreview] [pdf]

Abstract Options are widely used financial derivatives for risk management and corporate operations. Option hedging aims to mitigate investment risks from asset price fluctuations by buying and selling other financial products. Traditional hedging strategies based on the Black-Scholes model face practical limitations due to the assumptions of constant volatility and the neglect of transaction costs. Recently, reinforcement learning(RL) has gained attention in the study of option hedging strategies, but several challenges remain: current methods rely on real-time market data (e.g., underlying asset prices, holdings, remaining option term) to determine optimal positions, underutilizing the potential value of historical data; existing approaches focus on the expected hedging cost, overlooking the comprehensive distribution of costs; In the aspect of training data generation, commonly used single simulation methods perform well under specific conditions but struggle to ensure the robustness of the model across diverse datasets. To address these issues, we propose a novel distributional RL option hedging method that incorporates historical information. Historical states are included in the state variables, with a gated recurrent unit (GRU) network layer extracting historical information. This is then combined with current information from fully connected layers to inform subsequent network layers, ensuring the agent considers both current and historical market information when learning hedging strategies. The output of the value network is set as a series of quantiles, with the Quantile Huber Loss function fitting their distribution to evaluate strategies based on distribution rather than expected value. To diversify data sources, we use a combination of the Black-Scholes model, the Binomial model, and the Heston model to simulate a large volume of option data. Experimental results show that our method significantly reduces hedging costs and demonstrates strong adaptability and practicality under various market conditions.

87Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

[openreview] [pdf]

Abstract Traditional knowledge distillation focuses on aligning the student’s predicted probabilities with both ground-truth labels and the teacher’s predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.

88Generalizing to any diverse distribution: uniformity, gentle finetuning & rebalancing

[openreview] [pdf]

Abstract As training datasets grow larger, we aspire to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge by posing assumptions about the relation between training and test distribution. Differently, we adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain. Our first finding is that training on a uniform distribution over this domain is optimal. We also interrogate practical remedies when uniform samples are unavailable by considering methods for mitigating non-uniformity through finetuning and rebalancing. Our theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.d. generalization and foundation model training. We also provide new empirical evidence across tasks involving o.o.d. shifts which illustrate the broad applicability of our perspective.

89Counterfactual Techniques for Enhancing Customer Retention

[openreview] [pdf]

Abstract In this paper, we introduce a novel counterfactual reasoning method using eBERT embeddings to convert customers from an e-commerce company who frequently add items to their cart but don’t proceed to checkout. We demonstrate that our method i) outperforms existing techniques such as DiCE, GANs, and CFRL in key metrics such as coverage, while also maintaining a low latency; ii) balances high coverage and low latency by adjusting the number of nearest unlike neighbors, highlighting a trade-off between these competing goals; and iii) allows customization of mutable features, improving the practical applicability of our counterfactual explanations.

90Effectively Steer LLM To Follow Preference via Building Confident Directions

[openreview] [pdf]

Abstract Having an LLM that aligns with human preference is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest.The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which construct certain steering directions used to modify the model output}, are typically easy to implement and optimization-free. {However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidreictional steering), and that there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations in inference time. More specifically, CONFST builds a {\it confident direction} that is closely aligned with users’ preferences, and then this direction is added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) {It is more powerful, since multiple (i.e. more than two) users’ preferences can be aligned simultaneously; 2) It is very simple to implement, since there is no need to determine which layer the steering vector should be added to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across a number of different topics and styles.

91Learning mirror maps in policy mirror descent

[openreview] [pdf]

Abstract Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD’s full potential is limited, with the majority of research focusing on a particular mirror map---namely, the negative entropy---which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD’s efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e.\ Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.

92Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

[openreview] [pdf]

Abstract With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs torelearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we proposemeta-unlearningon DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered toself-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies.

93On Statistical Rates of Conditional Diffusion Transformer: Approximation and Estimation

[openreview] [pdf]

Abstract We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for ``in-context’’ conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers’ universal approximation through a more detailed piecewise constant approximation, and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models.

94Counterfactual Concept Bottleneck Models

[openreview] [pdf]

Abstract Current deep learning models are not designed to simultaneously address three fundamental questions: predict class labels to solve a given classification task (the “What?”), simulate changes in the situation to evaluate how this impacts class predictions (the “How?”), and imagine how the scenario should change to result in different class predictions (the “Why not?”). The inability to answer these questions represents a crucial gap in deploying reliable AI agents, calibrating human trust, and improving human-machine interaction. To bridge this gap, we introduce CounterFactual Concept Bottleneck Models (CF-CBMs), a class of models designed to efficiently address the above queries all at once without the need to run post-hoc searches. Our experimental results demonstrate that CF-CBMs: achieve classification accuracy comparable to black-box models and existing CBMs (“What?”), rely on fewer important concepts leading to simpler explanations (“How?”), and produce interpretable, concept-based counterfactuals (“Why not?”). Additionally, we show that training the counterfactual generator jointly with the CBM leads to two key improvements: (i) it alters the model’s decision-making process, making the model rely on fewer important concepts (leading to simpler explanations), and (ii) it significantly increases the causal effect of concept interventions on class predictions, making the model more responsive to these changes.

95Sampling from Energy-based Policies using Diffusion

[openreview] [pdf]

Abstract Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation — limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.

96Adaptive backtracking for fast optimization

[openreview] [pdf]

Abstract Backtracking line search is foundational in numerical optimization. The basic idea is to adjust the step size of an algorithm by a {\em constant} factor until some chosen criterion (e.g. Armijo, Goldstein, Descent Lemma) is satisfied. We propose a new way for adjusting step sizes, replacing the constant factor used in regular backtracking with one that takes into account the degree to which the chosen criterion is violated, without additional computational burden. We perform a variety of experiments on over fifteen real world datasets, which confirm that adaptive backtracking often leads to significantly faster optimization. For convex problems, we prove adaptive backtracking requires fewer adjustments to produce a feasible step size than regular backtracking does for two popular line search criteria: the Armijo condition and the descent lemma. For nonconvex smooth problems, we prove adaptive backtracking enjoys the same guarantees of regular backtracking.

97On the Byzantine-Resilience of Distillation-Based Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) algorithms using Knowledge Distillation (KD) have received increasing attention due to their favorable properties with respect to privacy, non-i.i.d. data and communication cost. These methods depart from transmitting model parameters and instead communicate information about a learning task by sharing predictions on a public dataset. In this work, we study the performance of such approaches in the byzantine setting, where a subset of the clients act in an adversarial manner aiming to disrupt the learning process. We show that KD-based FL algorithms are remarkably resilient and analyze how byzantine clients can influence the learning process. Based on these insights, we introduce two new byzantine attacks and demonstrate their ability to break existing byzantine-resilient methods. Additionally, we propose a novel defence method which enhances the byzantine resilience of KD-based FL algorithms. Finally, we provide a general framework to obfuscate attacks, making them significantly harder to detect, thereby improving their effectiveness. Our findings serve as an important building block in the analysis of byzantine FL, contributing through the development of new attacks and new defence mechanisms, further advancing the robustness of KD-based FL algorithms.

98Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

[openreview] [pdf]

Abstract State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-tofrequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment.

99Continual Learning After Model Deployment

[openreview] [pdf]

Abstract This paper studies continual learning after model deployment. A real-world application environment is often an open world filled with novel or out-of-distribution (OOD) objects that have not been seen before. We can call continual learning in such an environmentopen-world continual learning(OWCL). OWCL incrementally performs two main tasks: (1) detecting OOD objects, and (2) continually learning the OOD or new objects on the fly. Although OOD detection and continual learning have been extensively studied separately, their combination for OWCL has barely been attempted. This is perhaps because in addition to the existing challenges of OOD detection and continual learning such ascatastrophic forgetting(CF), OWCL also faces the challenge of data scarcity. As novel objects appear sporadically, when an object from a new/novel class is detected, it is difficult to learn it from one or a few samples to give good accuracy. This paper proposes a novel method called OpenLD to deal with these problems based onlinear discriminant analysis(LDA) and a pre-trained model. This method enables OOD detection and incremental learning of the detected samples on the fly with no CF. Experimental evaluation demonstrates the effectiveness of OpenLD.

100Transition Path Sampling with Improved Off-Policy Training of Diffusion Path Samplers

[openreview] [pdf]

Abstract Understanding transition pathways between meta-stable states in molecular systems is crucial to advance material design and drug discovery. However, unbiased molecular dynamics simulations are computationally infeasible due to the high energy barriers separating these states. Although recent machine learning techniques offer potential solutions, they are often limited to simple systems or rely on collective variables (CVs) derived from costly domain expertise. In this paper, we introduce a novel approach that trains diffusion path samplers (DPS) for transition path sampling (TPS) without the need for CVs. We recast the problem as an amortized sampling of the target path measure, minimizing the log-variance divergence between the path measure induced by our DPS and the target path measure. To ensure scalability for high-dimensional tasks, we introduce (1) a new off-policy training objective based on learning control variates with replay buffers and (2) a scale-based equivariant parameterization of the bias forces. We evaluate our approach, coined TPS-DPS, on a synthetic double-well potential and three peptides: Alanine Dipeptide, Polyproline Helix, and Chignolin. Results show that our approach produces more realistic and diverse transition pathways compared to existing baselines. We also provide links toproject pageandcode.

101One Model to Train Them All: A Unified Diffusion Framework for Multi-Context Neural Population Forecasting

[openreview] [pdf]

Abstract Recent research has revealed shared neural patterns among animals performing similar tasks and within individual animals across different tasks. This has led to a growing interest in replacing single-session latent variable models with a unified model that allows us to align recordings across different animals, sessions, and tasks, despite the challenge of distinct neuron identities in each recording. In this work, we present a conditioned diffusion framework to model population dynamics of neural activity across multiple contexts. The quality of the learned dynamics is evaluated through the model’s forecasting ability, which predicts multiple timesteps of both neural activity and behavior. Additionally, we introduce a benchmark dataset spanning six electrophysiology datasets, seven tasks, 19 animals, and 261 sessions, providing a standardized framework for multi-task neural population models. Our results demonstrate that the pretrained model can be efficiently adapted to novel, unseen sessions without requiring explicit neuron correspondence. This enables few-shot learning with minimal labeled data, as well as competitive performance in zero-shot learning.

102Knowledge Lift Alignment Fine Tuning

[openreview] [pdf]

Abstract We present a visual tuning framework, \textbf{K}nowledge \textbf{L}ift \textbf{A}lignment \textbf{F}ine \textbf{T}uning (KLAFT), which enhances the expressive image captioning capabilities of Pre-trained Language Models (PLMs), including LLMs and VLMs. As this task involves generating more detailed and comprehensive captions than basic image descriptions, the core idea behind KLAFT is that fine-grained alignment could exploit the capabilities of PLMs and a given target domain dataset. This idea motivates and challenges us to explore the framework that deeply understands both given images and text for this alignment and tuning PLMs towards expressive image captioning. This direction modifies the attention mechanism (Modified Attention Mechanism, MAM) and develops both a Topic Control Mechanism (TCM) and their training objectives. The innovation of KLAFT lies in its approach to addressing the disparities in knowledge - visual versus textual via MAM and source versus target domain via TCM. As these hidden spaces are conceptualized as distinct sub-networks within the PLM, each possessing specific knowledge, KLAFT’s unique contribution is in aligning and adjusting the weights of these sub-networks in a fine-grained manner, and fine-tuning this PLM. Our empirical studies demonstrate that KLAFT significantly improves expressive captioning tasks by aligning and amplifying target knowledge, with the potential for Parameter-Efficient Fine-Tuning (PEFT) at low computational cost.

103When do GFlowNets learn the right distribution?

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are an emerging class of sampling methods for distributions over discrete and compositional objects, e.g., graphs. In spite of their remarkable success in problems such as drug discovery and phylogenetic inference, the question of when and whether GFlowNets learn to sample from the target distribution remains underexplored. To tackle this issue, we first assess the extent to which a violation of the detailed balance of the underlying flow network might hamper the correctness of GFlowNet’s sampling distribution. In particular, we demonstrate that the impact of an imbalanced edge on the model’s accuracy is influenced by the total amount of flow passing through it and, as a consequence, is unevenly distributed across the network. We also argue that, depending on the parameterization, imbalance may be inevitable. In this regard, we consider the problem of sampling from distributions over graphs with GFlowNets parameterized by graph neural networks (GNNs) and show that the representation limits of GNNs delineate which distributions these GFlowNets can approximate. Lastly, we address these limitations by proposing a theoretically sound and computationally tractable metric for assessing GFlowNets, experimentally showing it is a better proxy for correctness than popular evaluation protocols.

104Efficient Fairness-Performance Pareto Front Computation

[openreview] [pdf]

Abstract There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. Due to the complexity of optimisation algorithms in most modern representation learning approaches, for a given method it may be non-trivial to decide whether the obtained fairness-performance curve of the method is optimal, i.e., whether it is close to the true Pareto front for these quantities for the underlying data distribution.In this paper we propose a new method to compute the optimal Pareto front, which does not require the training of complex representation models. We show that optimal fair representations possess several useful structural properties, and that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. We then also show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods. Finally, in addition to representations, we show that the new methods may also be used to directly compute the Pareto front of fair classification problems.Since our approach is independent of the specific model of representations, it may be used as the benchmark to which representation learning algorithms, or classifiers, may be compared. We experimentally evaluate the approach on a number of real world benchmark datasets.

105Combating Dual Noise Effect in Spatial-temporal Forecasting via Information Bottleneck Principle

[openreview] [pdf]

Abstract Spatial-temporal forecasting plays a pivotal role in urban planning and computing. Although Spatial-Temporal Graph Neural Networks (STGNNs) excel in modeling spatial-temporal dynamics, they often suffer from relatively poor computational efficiency. Recently, Multi-Layer Perceptrons (MLPs) have gained popularity in spatial-temporal forecasting for their simplified architecture and better efficiency. However, existing MLP-based models can be susceptible to noise interference, especially when the noise can affect both input and target sequences in spatial-temporal forecasting on noisy data. To alleviate this impact, we proposeRobust Spatial-Temporal Information Bottleneck (RSTIB)principle. The RSTIB extends previous Information Bottleneck (IB) approaches by lifting the specific Markov assumption without impairing the IB nature. Then, by explicitly minimizing the irrelevant noisy information, the representation learning guided by RSTIB can be more robust against noise interference. Furthermore, the instantiation, RSTIB-MLP, can be seamlessly implemented with MLPs, thereby achieving efficient and robust spatial-temporal modeling. Moreover, a training regime is designed to handle the dynamic nature of spatial-temporal relationships by incorporating a knowledge distillation module to alleviate feature collapse and enhance model robustness under noisy conditions. Our extensive experimental results on six intrinsically noisy benchmark datasets from various domains show that the RSTIB-MLP runs much faster than state-of-the-art STGNNs and delivers superior forecasting accuracy across noisy environments, substantiating its robustness and efficiency.

106Leveraging Knowledge Distillation to Mitigate Model Collapse

[openreview] [pdf]

Abstract Since the amount of data generated by neural networks on the Internet is growing rapidly due to widespread access to corresponding models, it is logical to inquire about the impact of this surge in synthetic data on the training of subsequent models that will utilize it during training. Previous work has demonstrated a concerning trend: models trained predominantly on synthetic data often experience a decline in performance, which can escalate to a complete loss of the ability to reproduce the initial distribution of real-world data. This phenomenon, now referred to as model collapse, highlights the potential pitfalls of over-reliance on synthetic datasets, which may lack the diversity and complexity inherent in genuine data. To address this issue, we propose a novel method that leverages the well-established technique of knowledge distillation. Our approach aims to mitigate the adverse effects of synthetic data by facilitating a more effective transfer of knowledge from high-performing teacher models to student model. By doing so, we seek to enhance not only the qualitative aspects—such as the richness and variability of the generated outputs—but also the quantitative metrics that gauge model performance. Through extensive experimentation, we demonstrate that our method improves the robustness and generalization capabilities of models trained on synthetic data, for instance, for DDPM enhancement is 68.8%, in terms of the FID metric, contributing to a more sustainable and effective use of synthetic datasets in machine learning applications.

107Robust Root Cause Diagnosis using In-Distribution Interventions

[openreview] [pdf]

Abstract Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. Effective root cause diagnosis calls for identifying nodes whose disrupted local mechanismscauseanomalous behavior at a target node. We propose IDI, a novel algorithm that predicts root cause as nodes that meet two criteria: 1)Anomaly:root cause nodes should take on anomalous values; 2)Fix:had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. Our theoretical analysis demonstrates that IDI’s in-distribution intervention approach outperforms other counterfactual estimation methods under mild assumptions about the data-generating process. Experiments on both synthetic and Petshop RCD benchmark datasets demonstrate that IDI consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. We release the anonymized code athttps://anonymous.4open.science/r/petshop-BB8A/.

108The Inductive Bias of Minimum-Norm Shallow Diffusion Models That Perfectly Fit the Data

[openreview] [pdf]

Abstract While diffusion models can generate high-quality images through the probability flow process, the theoretical understanding of this process is incomplete. A key open question is determining when the probability flow converges to the training samples used for denoiser training and when it converges to more general points on the data manifold. To address this, we analyze the probability flow of shallow ReLU neural network denoisers which interpolate the training data and have a minimal 2\ell^2 norm of the weights. For intuition, we also examine a simpler dynamics which we call the score flow, and demonstrate that, in the case of orthogonal datasets, the score flow and probability flow follow similar trajectories. Both flows converge to a training point or a sum of training points. However, due to early stopping induced by the scheduler, the probability flow can also converge to a general point on the data manifold. This result aligns with empirical observations that diffusion models tend to memorize individual training examples and reproduce them during testing. Moreover, diffusion models can combine memorized foreground and background objects, indicating they can learn a “semantic sum” of training points. We generalize these results from the orthogonal dataset case to scenarios where the clean data points lie on an obtuse simplex. Simulations further confirm that the probability flow converges to one of the following: a training point, a sum of training points, or a point on the data manifold.

109Mitigating Distribution Shifts: Uncertainty-Aware Offline-to-Online Reinforcement Learning

[openreview] [pdf]

Abstract Deploying reinforcement learning (RL) policies in real-world scenarios faces challenges due to distribution shifts from training environments. Past approaches have shown limitations such as poor generalization to out-of-distribution (OOD) variations or requiring extensive retraining on new data. We propose Uncertainty-aware Adaptive RL, UARL, a novel RL pipeline that enhances policy generalization across diverse variations of a given environment. UARL frames distribution shifts as OOD problems and incorporates a new OOD detection method to quantify uncertainty. This approach enables iterative policy fine-tuning, starting with offline training on a limited state space and progressively expanding to more diverse variations of the same environment through online interactions. We demonstrate the effectiveness and robustness of UARL through extensive experiments on continuous control tasks, showing improved performance and sample efficiency as well as reliability in OOD detection compared to existing methods.

110Diffusion Bridge Implicit Models

[openreview] [pdf]

Abstract Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we take the first step in fast sampling of DDBMs without extra training, motivated by the well-established recipes in diffusion models. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same marginal distributions and training objectives, and give rise to generative processes ranging from stochastic to deterministic, resulting in diffusion bridge implicit models (DBIMs). DBIMs are not only up to 25× faster than the vanilla sampler of DDBMs but also induce a novel, simple, and insightful form of ordinary differential equation (ODE) which inspires high-order numerical solvers. Moreover, DBIMs maintain the generation diversity in a distinguished way, by using a booting noise in the initial sampling step, which enables faithful encoding, reconstruction, and semantic interpolation in image translation tasks.

111Scaling Diffusion Models for Downstream Prediction

[openreview] [pdf]

Abstract In this paper, we argue that iterative computation, as exemplified by diffusion models, offers a powerful paradigm for not only image generation but also for visual perception tasks. First, we unify few of the mid-level vision tasks as image to image translations tasks ranging from depth estimation to optical flow to segmentation. Then, through extensive experiments across these tasks, we demonstrate how diffusion models scale with increased compute during both training and inference. Notably, we train various dense and Mixture of Expert models up to 2.8 billion parameters, and we utilize increased sampling steps, use various ensembling methods to increase compute at test time. Our work provides compelling evidence for the benefits of scaling compute at train and test time for diffusion models for visual perception, and by studying the scaling properties carefully, we were able to archive same performance of the state-of-the-art models with less compute.

112On the onset of memorization to generalization transition in diffusion models

[openreview] [pdf]

Abstract As the training set size increases, diffusion models have been observed to transition from memorizing the training dataset to generalizing to and sampling from the underlying data distribution. To study this phenomenon more closely, here, we first present a mathematically principled definition of this transition: the model is said to be in the generalization regime if the generated distribution is closer to the sampling distribution compared to the probability distribution associated with a Gaussian kernel approximation to the training dataset. Then, we develop an analytically tractable diffusion model that features this transition when the training data is sampled from an isotropic Gaussian distribution. Our study reveals that this transition occurs when the distance between the generated and underlying sampling distribution begins to decrease rapidly with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model’s memorization performance degrades, but generalization performance doesn’t improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.

113Pareto Prompt Optimization

[openreview] [pdf]

Abstract Natural language prompt optimization, or prompt engineering, has emerged as a powerful technique to unlock the potential of Large Language Models (LLMs) for various tasks. While existing methods primarily focus on maximizing a single task-specific performance metric for LLM outputs, real-world applications often require considering trade-offs between multiple objectives. In this work, we address this limitation by proposing an effective technique for multi-objective prompt optimization for LLMs. Specifically, we proposeParetoPrompt, a reinforcement learning~(RL) method that leverages dominance relationships between prompts to derive a policy model for prompts optimization using preference-based loss functions. By leveraging multi-objective dominance relationships, ParetoPrompt enables efficient exploration of the entire Pareto front without the need for a predefined scalarization of multiple objectives. Our experimental results show that ParetoPrompt consistently outperforms existing algorithms that use specific objective values. ParetoPrompt also yields robust performances when the objective metrics differ between training and testing.

114Decouple-Then-Merge: Towards Better Training for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a De\textbf{De}couple-then-Me\textbf{Me}rge (DeMe\textbf{DeMe}) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10. Code is included in the supplementary material and will be released on Github.

115Tra-MoE: Scaling Trajectory Prediction Models for Adaptive Policy Conditioning

[openreview] [pdf]

Abstract Scale is a primary factor that influences the performance and generalization of a robot learning system. In this paper, we aim to scale up the trajectory prediction model by using broad out-of-domain data to improve its robustness and generalization ability. Trajectory model is designed to predict any-point trajectories in the current frame given an instruction and can provide detailed control guidance for robotic policy learning. To handle the diverse out-of-domain data distribution, we propose a sparsely-gated MoE (\textbf{Top-1} gating strategy) architecture for trajectory model, coined as \textbf{Tra-MoE}. The sparse activation design enables good balance between parameter cooperation and specialization, effectively benefiting from large-scale out-of-domain data while maintaining constant FLOPs per token. In addition, we further introduce an adaptive policy conditioning technique by learning 2D mask representations for predicted trajectories, which is explicitly aligned with image observations to guide policy prediction more flexibly. We perform experiments on both simulation and real-world scenarios to verify the effectiveness of our Tra-MoE and adaptive policy conditioning technique. We jointly train the Tra-MoE model on all 130 tasks in the LIBERO benchmark and conduct a comprehensive empirical analysis, demonstrating that our Tra-MoE consistently exhibits superior performance compared to the dense baseline model, even when the latter is scaled to match Tra-MoE’s parameter count.

116AutoLoRA: AutoGuidance Meets Low-Rank Adaptation for Diffusion Models

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) is a fine-tuning technique that can be applied to conditional generative diffusion models. LoRA utilizes a small number of context examples to adapt the model to a specific domain, character, style, or concept. However, due to the limited data utilized during training, the fine-tuned model performance is often characterized by strong context bias and a low degree of variability in the generated images. To solve this issue, we introduce AutoLoRA, a novel guidance technique for diffusion models fine-tuned with the LoRA approach. Inspired by other guidance techniques, AutoLoRA searches for a trade-off between consistency in the domain represented by LoRA weights and sample diversity from the base conditional diffusion model. Moreover, we show that incorporating classifier-free guidance for both LoRA fine-tuned and base models leads to generating samples with higher diversity and better quality. The experimental results for several fine-tuned LoRA domains show superiority over existing guidance techniques on selected metrics.

117Amortized Posterior Sampling with Diffusion Prior Distillation

[openreview] [pdf]

Abstract We propose Amortized Posterior Sampling (APS), a novel variational inference approach for efficient posterior sampling in inverse problems. Our method trains a conditional flow model to minimize the divergence between the variational distribution and the posterior distribution implicitly defined by the diffusion model. This results in a powerful, amortized sampler capable of generating diverse posterior samples with a single neural function evaluation, generalizing across various measurements. Unlike existing methods, our approach is unsupervised, requires no paired training data, and is applicable to both Euclidean and non-Euclidean domains. We demonstrate its effectiveness on a range of tasks, including image restoration, manifold signal reconstruction, and climate data imputation. APS significantly outperforms existing approaches in computational efficiency while maintaining competitive reconstruction quality, enabling real-time, high-quality solutions to inverse problems across diverse domains.

118GUIDE: Guidance-based Incremental Learning with Diffusion Models

[openreview] [pdf]

Abstract We introduce GUIDE, a novel continual learning approach that directs diffusion models to rehearse samples at risk of being forgotten. Existing generative strategies combat catastrophic forgetting by randomly sampling rehearsal examples from a generative model. Such an approach contradicts buffer-based approaches where sampling strategy plays an important role. We propose to bridge this gap by incorporating classifier guidance into the diffusion process to produce rehearsal examples specifically targeting information forgotten by a continuously trained model. This approach enables the generation of samples from preceding task distributions, which are more likely to be misclassified in the context of recently encountered classes. Our experimental results show that GUIDE significantly reduces catastrophic forgetting, outperforming conventional random sampling approaches and surpassing recent state-of-the-art methods in continual learning with generative replay.

119Learning-Augmented Frequent Directions

[openreview] [pdf]

Abstract An influential paper of Hsu et al. (ICLR’19) introduced the study of learning-augmented streaming algorithms in the context of frequency estimation. A fundamental problem in the streaming literature, the goal of frequency estimation is to approximate the number of occurrences of items appearing in a long stream of data using only a small amount of memory. Hsu et al. develop a natural framework to combine the worst-case guarantees of popular solutions such as CountMin and CountSketch with learned predictions of high frequency elements. They demonstrate that learning the underlying structure of data can be used to yield better streaming algorithms, both in theory and practice.We simplify and generalize past work on learning-augmented frequency estimation. Our first contribution is a learning-augmented variant of the Misra-Gries algorithm which improves upon the error of learned CountMin and learned CountSketch and achieves the state-of-the-art performance of randomized algorithms (Aamand et al., NeurIPS’23) with a simpler, deterministic algorithm. Our second contribution is to adapt learning-augmentation to a high-dimensional generalization of frequency estimation corresponding to finding important directions (top singular vectors) of a matrix given its rows one-by-one in a stream. We analyze a learning-augmented variant of the Frequent Directions algorithm, extending the theoretical and empirical understanding of learned predictions to matrix streaming.

120GUARANTEED USER FAIRNESS IN RECOMMENDATION

[openreview] [pdf]

Abstract Although recommender systems (RS) have been well-developed for various fields of applications, they suffer from the crisis of platform credibility with respect to RS confidence and fairness, which may drive users away from the platform and result in the failure of the platform’s long-term success. In recent years, a few works have tried to solve either the model confidence or fairness issue, while there is no statistical guarantee for these methods. It is therefore an urgent need to solve both issues with a unifying framework with statistical guarantee. In this paper, we propose a novel and reliable framework called Guaranteed User Fairness in Recommendation (GUFR) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include the ground-truth items with user-predefined high confidence/probability (e.g., 90%); 2) to ensure user fairness across different groups; 3) to have the minimum average set size. We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method, and upper bounds of the risk and fairness metric are derived to help speed up the optimization process. Moreover, we provide rigorous theoretical analysis with respect to risk and fairness control as well as the minimum set size. Extensive experiments also validate the effectiveness of the proposed framework, which aligns with our theoretical analysis. The code is publicly available athttps://anonymous.4open.science/r/GUFR-76EC.

121SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Diffusion models have shown promise in data-driven planning. While these planners are commonly employed in applications where decisions are critical, they still lack established safety guarantees. In this paper, we address this limitation by introducing SafeDiffuser, a method to equip diffusion models with safety guarantees via control barrier functions. The key idea of our approach is to embed finite-time diffusion invariance, i.e., a form of specification consisting of safety constraints, into the denoising diffusion procedure. This way we enable data generation under safety constraints. We show that SafeDiffusers maintain the generative performance of diffusion models while also providing robustness in safe data generation. We evaluate our method on a series of tasks, including maze path generation, legged robot locomotion, and 3D space manipulation, and demonstrate the advantages of robustness over vanilla diffusion models.

122Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

[openreview] [pdf]

Abstract Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences. However, rather than merely generating designs that are natural, we often aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require differentiable proxy models (e.g., classifier guidance or DPS) or involve computationally expensive fine-tuning of diffusion models (e.g., classifier-free guidance, RL-based fine-tuning). In our work, we propose a new method to address these challenges. Our algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models. Notably, our approach avoids fine-tuning generative models and eliminates the need to construct differentiable models. This enables us to (1) directly utilize non-differentiable features/reward feedback, commonly used in many scientific domains, and (2) apply our method to recent discrete diffusion models in a principled way. Finally, we demonstrate the effectiveness of our algorithm across several domains, including image generation, molecule generation, and DNA/RNA sequence generation.

123Prompt-Agnostic Erasure for Diffusion Models Using Task Vectors

[openreview] [pdf]

Abstract With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow undesirable generations with other inputs. Here we focus on \textit{unconditionally} erasing a concept from a text-to-image model rather than conditioning the erasure on the user’s prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called \textit{Diverse Inversion}, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining model utility.

124SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace “losing images” in preference pairs. This approach allows us to optimize using only off-policy “winning images”. Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks.

125Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.

126DEALING WITH OUT OF DISTRIBUTION IN PREDICTION PROBLEM

[openreview] [pdf]

Abstract Open world assumption in model development means that a model may not have enough information to effectively handle data that is completely different or out of distribution (OOD). When a model encounters OOD data, it may suffer a significant decrease in performance. Addressing OOD data requires extensive fine-tuning and experimental trials, which in turn require substantial computational resources. Deep learning has been suggested as a solution and has shown significant improvements, but it often requires high-specification hardware, particularly GPUs, which may not always be readily available to general users. Additionally, there is a lack of clear guidance for common users on how to select and evaluate OOD data. This study delves into detection, evaluation, and prediction tasks within the context of OOD on tabular datasets. It demonstrates how common users can identify OOD data from real datasets and provides guidance on evaluating the OOD selection through experiments and visualizations. Furthermore, the study introduces tabular contrast learning (TCL), an enhanced technique specifically designed for tabular prediction tasks. TCL is more efficient compared to other baseline models, making it useful for general machine learning user with computational limitation on dealing with OOD problems. The study also includes a comprehensive comparison with existing approaches, focusing on both accuracy and computational efficiency.

127Statistical Test on Diffusion Model-based Anomaly Detection by Selective Inference

[openreview] [pdf]

Abstract Advancements in AI image generation, particularly diffusion models, have progressed rapidly. However, the absence of an established framework for quantifying the reliability of AI-generated images hinders their use in critical decision-making tasks, such as medical image diagnosis. In this study, we address the task of detecting anomalous regions in medical images using diffusion models and propose a statistical method to quantify the reliability of the detected anomalies. The core concept of our method involves a selective inference framework, wherein statistical tests are conducted under the condition that the images are produced by a diffusion model. With our approach, the statistical significance of anomaly detection results can be quantified in the form of a pp-value, enabling decision-making with controlled error rates, as is standard in medical practice. We demonstrate the theoretical soundness and practical effectiveness of our statistical test through numerical experiments on both synthetic and brain image datasets.

128Episodic Novelty Through Temporal Distance

[openreview] [pdf]

Abstract Exploration in sparse reward environments remains a significant challenge in reinforcement learning, particularly in Contextual Markov Decision Processes (CMDPs), where environments differ across episodes. Existing episodic intrinsic motivation methods for CMDPs primarily rely on count-based approaches, which are ineffective in large state spaces, or on similarity-based methods that lack appropriate metrics for state comparison. To address these shortcomings, we propose Episodic Novelty Through Temporal Distance (ETD), a novel approach that introduces temporal distance as a robust metric for state similarity and intrinsic reward computation. By employing contrastive learning, ETD accurately estimates temporal distances and derives intrinsic rewards based on the novelty of states within the current episode. Extensive experiments on various benchmark tasks demonstrate that ETD significantly outperforms state-of-the-art methods, highlighting its effectiveness in enhancing exploration in sparse reward CMDPs.

129On Inductive Biases That Enable Generalization in Diffusion Transformers

[openreview] [pdf]

Abstract Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating a DiT’s pivotal attention modules, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication.

130CURIOSITY IS THE PATH TO OPTIMIZATION

[openreview] [pdf]

Abstract In PAC theory, it is posited that larger hypothesis spaces necessitate more independently and identically distributed (i.i.d) data to maintain the accuracy of model performance. PAC-MDP theory defines curiosity by assigning higher rewards for visiting states that are far from the previously visited trajectory, which supports more independent and i.i.d data collection. Recently, this field has witnessed attempts to narrow the hypothesis space by developing additional mechanisms that train multiple skills and facilitate the sharing of information among them, thereby discovering commonalities. However, one might wonder: What if curiosity could not only enhance the efficiency of data collection but also significantly reduce the hypothesis space, thereby driving optimal outcomes independently without additional mechanism used in PAC-MDP? Significant discussion has been devoted to the reduction of hypothesis spaces and the utilization of curiosity. Within this context, contrastive multi-skill reinforcement learning (RL) exhibits both traits. Previous research in contrastive multi-skill RL has utilized this technique primarily as a form of pretraining, However, there has been scant investigation into whether the technique itself can reduce the hypothesis space to optimize the outcomes. We have mathematically proven that curiosity provides bounds to guarantee optimality in contrastive multi-skill reinforcement learning (RL). Additionally, we have leveraged these findings to develop an algorithm that is applicable in real-world scenarios, which has been demonstrated to surpass other prominent algorithms. Furthermore, our experiments have shown that different skills are actually reducing the hypothesis space of the policy by being hierarchically grouped.

131Latent Abstractions in Generative Diffusion Models

[openreview] [pdf]

Abstract In this work we study how diffusion-based generative models produce high-dimensional data, such as an image, by implicitly relying on a manifestation of a low-dimensional set of latent abstractions, that guide the generative process. We present a novel theoretical framework that extends Nonlinear Filtering (NLF), and that offers a unique perspective on SDE-based generative models. The development of our theory relies on NLF, including a novel formulation of the joint (state and measurement) dynamics, and an information-theoretic measure of the influence of the system state on the measurement process. According to our theory, diffusion models can be cast as a system of SDE, describing a non-linear filter in which the evolution of unobservable latent abstractions steers the dynamics of an observable measurement process (corresponding to the generative pathways). In addition, we present an empirical study to validate our theory and previous empirical results on the emergence of latent abstractions at different stages of the generative process.

132Fast Diversity-Preserving Reward Finetuning of Diffusion Models via Nabla-GFlowNets

[openreview] [pdf]

Abstract While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to finetune pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer either 1) lack of diversity in generated samples, or 2) costly finetuning and slow convergence. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as \nabla-GFlowNet), together with an objective called \nabla-DB, plus its variant residual \nabla-DB for finetuning pretrained diffusion models. These objectives leverage the rich signal in reward gradients for diversity-aware finetuning. We empirically show that our proposed residual \nabla-DB achieves fast yet diversity- & prior-preserving finetuning of StableDiffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.

133Constrained Diffusion Implicit Models

[openreview] [pdf]

Abstract This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose conditional diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconditional DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.

134Uncertainty Prioritized Experience Replay

[openreview] [pdf]

Abstract Prioritized experience replay, which improves sample efficiency by selecting relevant transitions to update parameter estimates, is a crucial component of contemporary deep reinforcement learning models. Typically, transitions are prioritized based on their temporal difference error. However, this approach is prone to favoring noisy transitions, even when the value estimation closely approximates the target mean. This phenomenon resembles thenoisy TVproblem postulated in the exploration literature, in which exploration-guided agents get stuck by mistaking noise for novelty. To mitigate the disruptive effects of noise in value estimation, we propose using epistemic uncertainty to guide the prioritization of transitions from the replay buffer. Epistemic uncertainty quantifies the uncertainty that can be reduced by learning, hence reducing transitions sampled from the buffer generated by unpredictable random processes. We first illustrate the benefits of epistemic uncertainty prioritized replay in two tabular toy models: a simple multi-arm bandit task, and a noisy gridworld. Subsequently, we evaluate our prioritization scheme on the Atari suite, outperforming quantile regression deep Q-learning benchmarks; thus forging a path for the use of epistemic uncertainty prioritized replay in reinforcement learning agents.

135Replay can provably increase forgetting

[openreview] [pdf]

Abstract Continual learning seeks to enable machine learning systems to solve an increasing corpus of tasks sequentially. A critical challenge for continual learning is forgetting, where the performance on previously learned tasks decreases as new tasks are introduced. One of the commonly used techniques to mitigate forgetting, sample replay, has been shown empirically to reduce forgetting by retaining some examples from old tasks and including them in new training episodes. In this work, we provide a theoretical analysis of sample replay in an over-parameterized continual linear regression setting, where given enough replay samples, one would be able to eliminate forgetting. Our analysis focuses on replaying a few examples and highlights the role of the replay samples and task subspaces. Surprisingly, we find that forgetting can be non-monotonic with respect to the number of replay samples. We construct tasks where replay of a single example can increase forgetting and even distributions where replay of a randomly selected sample increases forgetting on average. We provide empirical evidence that this is a property of the tasks rather than the model used to train on them, by showing a similar behavior for a neural net equipped with SGD. Through experiments on a commonly used benchmark, we provide additional evidence that performance of the replay heavily depends on the choice of replay samples and the relationship between tasks.

136PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

[openreview] [pdf]

Abstract Diffusion Probabilistic Models (DPMs) have shown remarkable potential in image generation, but their sampling efficiency is hindered by the need for numerous denoising steps. Most existing solutions accelerate the sampling process by proposing fast ODE solvers. However, the inevitable discretization errors of the ODE solvers are significantly magnified when the number of function evaluations (NFE) is fewer. In this work, we propose PFDiff, a novel training-free and orthogonal timestep-skipping strategy, which enables existing fast ODE solvers to operate with fewer NFE. Specifically, PFDiff initially utilizes gradient replacement from past time steps to predict a “springboard”. Subsequently, it employs this “springboard” along with foresight updates inspired by Nesterov momentum to rapidly update current intermediate states. This approach effectively reduces unnecessary NFE while correcting for discretization errors inherent in first-order ODE solvers. Experimental results demonstrate that PFDiff exhibits flexible applicability across various pre-trained DPMs, particularly excelling in conditional DPMs and surpassing previous state-of-the-art training-free methods. For instance, using DDIM as a baseline, we achieved 16.46 FID (4 NFE) compared to 138.81 FID with DDIM on ImageNet 64x64 with classifier guidance, and 13.06 FID (10 NFE) on Stable Diffusion with 7.5 guidance scale.

137Efficient Discovery of Pareto Front for Multi-Objective Reinforcement Learning

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm’s performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).

138A Study of Posterior Stability for Time-Series Latent Diffusion

[openreview] [pdf]

Abstract Latent diffusion has demonstrated promising results in image generation and permits efficient sampling. However, this framework might suffer from the problem of posterior collapse when applied to time series. In this paper, we first show that posterior collapse will reduce latent diffusion to a variational autoencoder (VAE), making it less expressive. This highlights the importance of addressing this issue. We then introduce a principled method: dependency measure, that quantifies the sensitivity of a recurrent decoder to input variables. Using this tool, we confirm that posterior collapse significantly affects time-series latent diffusion on real datasets, and a phenomenon termed dependency illusion is also discovered in the case of shuffled time series. Finally, building on our theoretical and empirical studies, we introduce a new framework that extends latent diffusion and has a stable posterior. Extensive experiments on multiple real time-series datasets show that our new framework is free from posterior collapse and significantly outperforms previous baselines in time series synthesis.

139Diffusion-PINN Sampler

[openreview] [pdf]

Abstract Recent success of diffusion models has inspired a surge of interest in developing sampling techniques using reverse diffusion processes. However, accurately estimating the drift term in the reverse stochastic differential equation (SDE) solely from the unnormalized target density poses significant challenges, hindering existing methods from achieving state-of-the-art performance. In this paper, we introduce the Diffusion-PINN Sampler (DPS), a novel diffusion-based sampling algorithm that estimates the drift term by solving the governing partial differential equation of the log-density of the underlying SDE marginals via physics-informed neural networks (PINN). We prove that the error of log-density approximation can be controlled by the PINN residual loss, enabling us to establish convergence guarantees of DPS. Experiments on a variety of sampling tasks demonstrate the effectiveness of our approach, particularly in accurately identifying mixing proportions when the target contains isolated components.

140LLM Pruning and Distillation in Practice

[openreview] [pdf]

Abstract Structured pruning with knowledge distillation is a potent combination for obtaining small language models (SLMs) with significantly fewer training tokens and compute resources compared to training from scratch. In this work, we investigate how this strategy can be effectively applied in instances where access to the the original pretraining dataset is restricted. We introduce a newteacher correctionphase before distillation which lets the teacher model adjust to our specific data distribution using a lightweight fine-tuning phase. We apply this strategy to compress the Mistral NeMo 12B and Llama 3.1 8B models to 8B and 4B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and further tested for instruction following, role-play, math, coding and function calling capabilities. This approach produces the state-of-the-art Mistral-NeMo-Compressed-8B (\MNMinitron for brevity) model from Mistral NeMo 12B, and a compelling 4B model from Llama 3.1 8B.

141Constant Rate Schedule: Constant-Rate Distributional Change for Efficient Training and Sampling in Diffusion Models

[openreview] [pdf]

Abstract We propose a noise schedule that ensures a constant rate of change in the probability distribution of diffused data throughout the diffusion process. To obtain this noise schedule, we measure the rate of change in the probability distribution of the forward process and use it to determine the noise schedule before training diffusion models. The functional form of the noise schedule is automatically determined and tailored to each dataset and type of diffusion model. We evaluate the effectiveness of our noise schedule on unconditional and class-conditional image generation tasks using the LSUN (bedroom/church/cat/horse), ImageNet, and FFHQ datasets. Through extensive experiments, we confirmed that our noise schedule broadly improves the performance of the diffusion models regardless of the dataset, sampler, number of function evaluations, or type of diffusion model.

142Agential AI for integrated continual learning, deliberative behavior, and comprehensible models

[openreview] [pdf]

Abstract Contemporary machine learning paradigm excels in statistical data analysis, solving problems that classical AI couldn’t. However, it faces key limitations, such as lack of integration with planning, incomprehensible internal structures, and inability to learn continually without erasing prior knowledge. We present initial design for an AI system, Agential AI (AAI), in principle operating independently or on top of statistical methods, that overcomes all these issues. AAI’s core is a learning method that models temporal dynamics with guarantees of completeness, minimality, and continual learning. It integrates this with a behavior algorithm that plans on a learned model and encapsulate high-level behavior patterns. Preliminary experiments on a simple abstract environment show AAI’s effectiveness and future potential.

143Moonwalk: Inverse-Forward Differentiation

[openreview] [pdf]

Abstract Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible and right-invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of naïve forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible and right-invertible networks in computation time comparable to backpropagation and using significantly less memory.

144Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

[openreview] [pdf]

Abstract In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization. The order in which the data from these domains is used for training can significantly affect the model’s performance on each domain. However, this dependence is under-studied. In this paper, we investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields. By analyzing the infinitesimal effects of changing the training order, we identify regions in the parameter space where altering the order between two training domains can benefit the target loss. We validate the predictions of our theoretical framework on the influence of training order (or data mixing) both on a toy example and bilingual LLM pre-training.

145Last-Iterate Convergence of Smooth Regret Matching+Variants in Learning Nash Equilibria

[openreview] [pdf]

Abstract Regret Matching+ (RM+) variants have been widely developed to superhuman Poker AIs, yet few studies investigate their last-iterate convergence. Their last-iterate convergence has been demonstrated only for games with strong monotonicity or two-player zero-sum matrix games. A primary obstacle in proving the last-iterate convergence for these algorithms is that their feedback is not the loss gradient of the vanilla games. This deviation results in the absence of crucial properties, \eg, monotonicity or the weak Minty variation inequality (MVI), which are pivotal for establishing the last-iterate convergence. To address the absence of these properties, we propose a remarkably succinct yet novel proof paradigm that consists of: (i) recovering these key properties through the equivalence between RM+ and Online Mirror Descent (OMD), and (ii) measuring the the distance to Nash equilibrium (NE) via the tangent residual to show this distance is related to the distance between accumulated regrets. To show the practical applicability of our proof paradigm, we use it to prove the last-iterate convergence of two existing smooth RM+ variants, Smooth Extra-gradient RM+ (SExRM+) and Smooth Predictive RM+ (SPRM+). We show that they achieve last-iterate convergence in learning an NE of games satisfying monotonicity, a weaker condition than the one used in existing proofs for both variants. Then, inspired by our proof paradigm, we propose Smooth Optimistic Gradient RM+ (SOGRM+). We show that SOGRM+ achieves last-iterate convergence in learning an NE of games satisfying the weak MVI, the weakest condition in all known proofs for RM+ variants. The experimental results show that SOGRM+ significantly outperforms other algorithms.

146Guided Reinforcement Learning with Roll-Back

[openreview] [pdf]

Abstract Reinforcement learning-based solutions are increasingly being considered as strong alternatives to classical system controllers, despite their significant sample inefficiency when learning controller tasks from scratch. Many methods that address this issue use prior task knowledge to guide the agent’s learning, with several recent algorithms providing a guide policy that is sometimes chosen to execute actions instead of the learner policy. While this approach lends excellent flexibility as it allows the guide knowledge to be provided in any format, it can be challenging to decide when and for how long to use the guide agent. Current guide policy-based approaches typically choose a static guide sampling rate empirically, and do not vary it. Approaches that transfer control use simple methods like linear decay, or require hyperparameter choices that strongly impact the performance. We show that under certain assumptions, the sampling rate of the guide policy can be calculated to guarantee that the mean return of the learning policy will surpass a user-defined performance degradation threshold. To the best of our knowledge, this is the first time a performance guarantee has been established for a guided RL method. We then implement a guided RL (GRL) algorithm that can make use of this sample rate, and additionally introduce a roll-back feature in guided RL with roll-back (GRL-RB) to adaptively balance the trade-off between performance degradation and rapid transfer of control to the learner. Our approach is simple to implement on top of existing algorithms, robust to hyperparameter choices, and effective in warm-starting online learning.

147On Rollouts in Model-Based Reinforcement Learning

[openreview] [pdf]

Abstract Model-based reinforcement learning (MBRL) seeks to enhance data efficiency by learning a model of the environment and generating synthetic rollouts from it. However, accumulated model errors during these rollouts can distort the data distribution, negatively impacting policy learning and hindering long-term planning. Thus, the accumulation of model errors is a key bottleneck in current MBRL methods. We propose Infoprop, a model-based rollout mechanism that separates aleatoric from epistemic model uncertainty and reduces the influence of the latter on the data distribution. Further, Infoprop keeps track of accumulated model errors along a model rollout and provides termination criteria to limit data corruption. We demonstrate the capabilities of Infoprop in the Infoprop-Dyna algorithm, reporting state-of-the-art performance in Dyna-style MBRL on common MuJoCo benchmark tasks while substantially increasing rollout length and data quality.

148Multi-Student Diffusion Distillation for Better One-Step Generators

[openreview] [pdf]

Abstract Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model’s inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of possible conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD obtains competitive results with a faster inference time for single-step generation. Using same-sized students, MSD with 4 students sets new state-of-the-art results for one-step image generation: FID 1.20 on ImageNet-64×64 and 8.20 on zero-shot COCO2014.

149The Superposition of Diffusion Models

[openreview] [pdf]

Abstract The undeniable success of deep generative models for learning complex and high-dimensional data distributions has led to the proliferation of large-scale diffusion models across the entire machine-learning application spectrum. This Cambrian explosion of easily accessible pre-trained models, including fine-tuned open-source models on user-specific data, suggests a demand for methods that combine multiple different pre-trained models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performedsolely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiffis efficient during inference time, and mimics traditional composition operators such as the logical OR\texttt{OR} and the logical AND\texttt{AND}. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditionalde novostructure design of proteins.

150Expected Return Symmetries

[openreview] [pdf]

Abstract Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move “left” or “right”, and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.

151Avoiding mode collapse in diffusion models fine-tuned with reinforcement learning

[openreview] [pdf]

Abstract Fine-tuning foundation models via reinforcement learning (RL) has proven promising for aligning to downstream objectives. In the case of diffusion models (DMs), though RL training improves alignment from early timesteps, critical issues such as training instability and mode collapse arise. We address these drawbacks by exploiting the hierarchical nature of DMs: we train them dynamically at each epoch with a tailored RL method, allowing for continual evaluation and step-by-step refinement of the model performance (or alignment). Furthermore, we find that not every denoising step needs to be fine-tuned to align DMs to downstream tasks. Consequently, in addition to clipping, we regularise model parameters at distinct learning phases via a sliding-window approach. Our approach, termed Hierarchical Reward Fine-tuning (HRF), is validated on the Denoising Diffusion Policy Optimisation method, where we show that models trained with HRF achieve better preservation of diversity in downstream tasks, thus enhancing the fine-tuning robustness and at uncompromising mean rewards.

152Latent Diffusion Planning for Imitation Learning

[openreview] [pdf]

Abstract Recent progress in robotic imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods rely on supervised learning of actions from expert demonstrations, which can be challenging to scale. We propose Latent Diffusion Planning, which forecasts future states as well as actions via diffusion. This objective can scalably leverage heterogeneous data sources and provides a denser supervision signal for learning. To plan over images, we learn a compact latent space through a variational autoencoder. We then train a planner to forecast future latent states, and an inverse dynamics model to extract actions from the plans. As planning is separated from action prediction, LDP can leverage suboptimal or action-free data to improve performance in low demonstration regimes. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches as they cannot leverage such additional data.

153DuRND: Rewarding from Novelty to Contribution for Reinforcement Learning via Dual Random Networks Distillation

[openreview] [pdf]

Abstract Existing reward shaping techniques for sparse-reward tasks in reinforcement learning generally fall into two categories: novelty-based exploration bonuses and value-based rewards. The former encourages agents to explore less visited areas but can divert them from their main objectives, while the latter promotes stable late-stage convergence but often lacks sufficient early exploration. To combine the benefits of both, we propose Dual Random Networks Distillation (DuRND), a novel framework integrating two lightweight random network modules. These modules jointly generate two rewards: a novelty reward to drive exploration and a contribution reward to evaluate progress toward desired behaviors, achieving an efficient balance between exploration and exploitation. With low computational overhead, DuRND excels in high-dimensional environments like Atari, VizDoom, and MiniWorld, outperforming several benchmarks.

154Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

[openreview] [pdf]

Abstract Predictable behavior from scaling advanced AI systems is an extremely desirable property for engineers, companies, economists and governments alike, and while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper shines a light on a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

155State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

[openreview] [pdf]

Abstract Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements.\textit{zero-shot generalization to states that are unseen combinations of previously seen elements.} In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

156Scaling Concept With Text-Guided Diffusion Models

[openreview] [pdf]

Abstract Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content based on text descriptions. Additionally, they have enabled an editing paradigm where concepts can be replaced through text conditioning. In this work, we explore a novel paradigm: instead of replacing a concept, can we scale it? We conduct an empirical study to investigate concept decomposition trends in text-guided diffusion models. Leveraging these insights, we propose a simple yet effective method, ScalingConcept, designed to enhance or suppress existing concepts in real input without introducing new ones. To systematically evaluate our method, we introduce the WeakConcept-10 dataset. More importantly, ScalingConcept enables a range of novel zero-shot applications across both image and audio domains, including but not limited to canonical pose generation and generative sound highlighting/removal.

157Distributional Sobolev reinforcement learning

[openreview] [pdf]

Abstract Distributional reinforcement learning (DRL) is a framework for learning a complete distribution over returns, rather than merely estimating expectations. In this paper, we further expand DRL by estimating a distribution over the gradient of the state-action value function, in addition to its scalar value. We refer to this method as Distributional Sobolev training. Inspired by Stochastic Value Gradients (SVG), we achieve this by leveraging a one-step world model of the reward and transition distributions implemented using a conditional Variational Autoencoder (cVAE). Our approach is sampled-based and relies on Maximum Mean Discrepancy (MMD) to instantiate the distributional Bellman operator. We first showcase the method on a toy supervised learning problem. We then validate our algorithm in several Mujoco/Brax environments.

158Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) aligns Large Language Models (LLMs) with human preferences. However, these preferences can often change over time due to external factors (e.g. environment change and societal influence). Consequently, what was wrong then might be right now. Current preference optimization algorithms do not account for temporal preference drift in their modeling, which can lead to severe misalignment. To address this limitation, we use a Dynamic Bradley-Terry model that models preferences via time-dependent reward functions, and propose Non-Stationary Direct Preference Optimisation (NS-DPO). By introducing a discount parameter in the loss function, NS-DPO applies exponential weighting, which proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in the offline setting, providing upper bounds on the estimation error caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO1 for fine-tuning LLMs in scenarios with drifting preferences. By simulating preference drift using renowned reward models and modifying popular LLM datasets accordingly, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

159On the Generalization of Preference Learning with DPO

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remains lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization. While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

160Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies.

161Rethinking Knowledge Distillation: A Mixture-of-Experts Perspective

[openreview] [pdf]

Abstract Knowledge distillation (KD) aims to transfer useful information from a large-scale model (teacher) to a lightweight model (student). Classical KD focuses on leveraging the teacher’s predictions as soft labels to regularize student training. However, the exact match of predictions in Kullback-Leibler (KL) divergence could be somewhat in conflict with the classification objective, given that the distribution discrepancies between teacher-generated predictions and ground-truth annotations tend to be fairly severe. In this paper, we rethink the role of teacher predictions from a Mixture-of-Experts (MoE) perspective and transfer knowledge by introducing teacher predictions as latent variables to reformulate the classification objective. This MoE strategy results in breaking down the vanilla classification task into a mixture of easier subtasks with the teacher classifier as a gating function to weigh the importance of subtasks. Each subtask is efficiently conquered by distinct experts that are effectively implemented by resorting to multi-level teacher outputs. We further develop a theoretical framework to formulate our method, termed MoE-KD, as an Expectation-Maximization (EM) algorithm and provide proof of the convergence. Extensive experiments manifest that MoE-KD outperforms advanced knowledge distillers on mainstream benchmarks.

162A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations

[openreview] [pdf]

Abstract Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain human preferences between trajectory pairs. Existing offline PbRL studies typically model a reward function by maximizing its likelihood of generating the observed human preferences. However, due to the varying number of samples within the limited dataset, less frequently compared trajectories exhibit greater uncertainty, which potentially leads to unrelible behaviors during reward and policy updates. To solve this issue, in this work, we introduce Uncertainty-Aware PbRL (UA-PbRL) to learn a distributional reward model and a risk-sensitive policy from an offline preference dataset. Our approach employs a Maximum A Posteriori (MAP) objective to update trajectory rewards and incorporates an informative prior to account for the uncertainties. Building upon this reward update, we propose a generative reward model to capture the reward distribution, utilizing the offline distributional Bellman operator and the Conditional Value-at-Risk (CVaR) metric to train a risk-sensitive policy. Experimental results demonstrate that UA-PbRL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment.

163Counterfactual History Distillation on Continuous-time Event Sequences

[openreview] [pdf]

Abstract This study aims to distill history events that have essential information for predicting subsequent events with counterfactual analysis. The problem is named Counterfactual History Distillation (CHD). CHD distills a minimum set of events from history, based on which the distribution provided by a trained MTPP model fits the events observed later, and the distribution based on the remaining events in history cannot. It can help understand what event marks may have more influence on the occurrence of future events and what events in history may have a causal relationship with the events observed later. This study proposes a robust solution for CHD, called MTPP-based Counterfactual History Distiller (MTPP-CHD). MTPP-CHD learns to select the optimal event combination from history for the events observed later. Experiment results demonstrate the superiority of MTPP-CHD by outperforming baselines in terms of distillation quality and processing speed.

164Orient Anything

[openreview] [pdf]

Abstract Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape’s orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

165Diversifying Spurious Subgraphs for Graph Out-of-Distribution Generalization

[openreview] [pdf]

Abstract Environment augmentation methods have gained some success in overcoming the out-of-distribution (OOD) generalization challenge in Graph Neural Networks (GNNs). Yet, there exists a challenging trade-off in the augmentation: On one hand, it requires the generated graphs as diverse as possible to extrapolate to unseen environments. On the other hand, it requires the generated graphs to preserve the invariant substructures causally related to the targets. Existing approaches have proposed various environment augmentation strategies to enrich spurious patterns for OOD generalization. However, we argue that these methods remain limited in diversity and precision of the generated environments for two reasons: i) the deterministic nature of the graph composition strategy used for environment augmentation may limit the diversity of the generated environments, and ii) the presence of spurious correlations may lead to the exclusion of invariant subgraphs and reduce the precision of the generated environments. To address this trade-off, we propose a novel paradigm that accurately identifies spurious subgraphs, and an environment augmentation strategy called spurious subgraph diversification, which extrapolates to maximally diversified spurious subgraphs by randomizing the spurious subgraph generation, while preserving the invariant substructures. Our method is theoretically sound and demonstrates strong empirical performance on both synthetic and real-world datasets, outperforming the second-best method by up to 24.19% across 17 baseline methods, underscoring its superiority in graph OOD generalization.

166Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment

[openreview] [pdf]

Abstract Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning (SFT) based upon large demonstrations data, (ii) a reward model (RM) is estimated based upon human feedback data, and (iii) reinforcement learning (RL) is used to further refine the SFT model by optimizing the estimated reward model. Typically, the number of parameters in the reward model greatly exceeds the number of preference observations in the human feedback data. As a result, the reward model is likely inaccurate and the resulting policy model (fine-tuned with RL) may exhibit poor alignment performance. In this paper, we introduce a new approach AIHF in which reward and policy models are {\em jointly} trained by simultaneously leveraging demonstration and human feedback data.We introduce a tractable algorithm for finding the AIHF reward and policy models and provide a finite time performance guarantee.Additionally, we demonstrate the efficiency of the proposed solution with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the data is unbalanced.

167Training-Free Diffusion Model Alignment with Sampling Demons

[openreview] [pdf]

Abstract Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgments. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation.

168DDIL: Improved Diffusion Distillation with Imitation Learning

[openreview] [pdf]

Abstract Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Diffusion distillation methods have shown promise by reducing the number of passes at the expense of quality of the generated samples but suffer from lack of diversity, quality, etc. . In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learningDDILframework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations bypreserving marginal data distributionand training on student distribution addresses compounding error bycorrecting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL and reflected diffusion formulation consistency improves on baseline algorithms of progressive distillation(PD), Latent consistency models(LCM)and Distribution Matching Distillation(DMD2)

169Medium-Difficulty Samples Constitute Smoothed Decision Boundary for Knowledge Distillation on Pruned Datasets

[openreview] [pdf]

Abstract This paper tackles a new problem of dataset pruning for Knowledge Distillation (KD), from a fresh perspective of Decision Boundary (DB) preservation and drifts. Existing dataset pruning methods generally assume that the post-pruning DB formed by the selected samples can be well-captured by future networks that use those samples for training. Therefore, they tend to preserve hard samples since hard samples are closer to the DB and better characterize the nuances in the distribution of the entire dataset. However, in KD, the limited learning capacity from the student network leads to imperfect preservation of the teacher’s feature distribution, resulting in the drift of DB in the student space. Specifically, hard samples worsen such drifts as they are difficult for the student to learn, creating a situation where the student’s DB can drift deeper into other classes and make incorrect classifications. Motivated by these findings, our method selects medium-difficulty samples for KD-based dataset pruning. We show that these samples constitute a smoothed version of the teacher’s DB and are easier for the student to learn, obtaining a general feature distribution preservation for a class of samples and reasonable DB between different classes for the student. In addition, to reduce the distributional shift due to dataset pruning, we leverage the class-wise distributional information of the teacher’s outputs to reshape the logits of the preserved samples. Experiments show that the proposed static pruning method can even perform better than the state-of-the-art dynamic pruning method which needs access to the entire dataset. In addition, our method halves the training times of KD and improves the student’s accuracy by 0.4% on ImageNet with a 50% keep ratio. When the ratio further increases to 70%, our method achieves higher accuracy over the vanilla KD while reducing the training times by 30%.

170RAGDP: Retrieve-Augmented Generative Diffusion Policy

[openreview] [pdf]

Abstract Diffusion Policy has attracted attention for its ability to achieve significant accuracy gains in a variety of imitation learning tasks. However, since Diffusion Policy relies on the Diffusion Model, it requires multiple denoising steps to generate a single action leading to long generation times. To address this issue, methods like DDIM and Consistency Models have been introduced to speed up the process. While these methods reduce computation time, this often comes at the cost of accuracy. In this paper, we propose RAGDP, a technique designed to improve the efficiency of learned Diffusion Policies without sacrificing accuracy. RAGDP builds upon the Retrieval-Augmented Generation (RAG) technique, which is commonly used in large language models to store and retrieve data from a vector database based on encoded embeddings. In RAGDP, pairs of expert observation and actions data are stored in a vector database. The system then searches the database using encoded observation data to retrieve expert action data with high similarity. This retrieved expert data is subsequently used by the RAGDP algorithm to generate actions tailored to the current environment. We introduce two action generation algorithms, RAGDP-VP and RAGDP-VE, which correspond to different types of Diffusion Models. Our results demonstrate that RAGDP can significantly improve the speed of Diffusion Policy without compromising accuracy. Furthermore, RAGDP can be integrated with existing speed-up methods to enhance their performance.

171Provable Causal State Representation under Asynchronous Diffusion Model for POMDPs

[openreview] [pdf]

Abstract A major challenge in applying reinforcement learning (RL) to real-world scenarios is managing high-dimensional, noisy perception input signals. Identifying and utilizing representations that contain sufficient and essential information for decision-making tasks is key to computational efficiency and generalization of RL by reducing bias in decision-making processes. In this paper, we present a new RL framework, namedCausal State Representation under Asynchronous Diffusion Model (CSR-ADM), which accommodates and enhances any RL algorithm for partially observable Markov decision processes (POMDPs) with perturbed inputs. A new asynchronous diffusion model is proposed to denoise both reward and observation spaces, and integrated with the bisimulation technology to capture causal state representations in POMDPs. Notably, the causal state is the coarsest partition of the denoised observations. We link the causal state to a causal feature set and provide theoretical guarantees by deriving the upper bound on value function approximation between the noisy observation space and the causal state space, demonstrating equivalence to bisimulation under the Lipschitz assumption. To the best of our knowledge, CSR-ADM is the first framework to approximate causal states with diffusion models, substantiated by a comprehensive theoretical foundation. Extensive experiments on Roboschool tasks show that CSR-ADM outperforms state-of-the-art methods, significantly improving the robustness of existing RL algorithms under varying scales of random noise.

172Model predictive control is almost optimal for restless bandits

[openreview] [pdf]

Abstract We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem. We propose a model predictive control based non-stationary policy with a rolling computational horizon τ. At each time-slot, this policy solves a τ horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of τ and the number of arms, NN. We show that its sub-optimality gap is O(1/N)O(1/\sqrt{N}) in general, and exp(ΩN)\exp(-\Omega{N}) under a local-stability condition. Our proof is based on a framework from dynamic control known as dissipativity. Not only is our solution easy to implement but performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus, be of great interest to the burgeoning RMAB community.

173Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models

[openreview] [pdf]

Abstract The machine learning community is increasingly recognizing the importance of fostering trust and safety in modern generative AI (GenAI) models. We posit machine unlearning (MU) as a crucial foundation for developing safe, secure, and trustworthy GenAI models. Traditional MU methods often rely on stringent assumptions and require access to real data. This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models by aligning the conditional scores of “unsafe” classes or concepts with those of “safe” ones. To eliminate the need for real data, our SFD framework incorporates a score-based MU loss into the score distillation objective of a pretrained diffusion model. This serves as a regularization term that preserves desired generation capabilities while enabling the production of synthetic data through a one-step generator. Our experiments on pretrained label-conditional and text-to-image diffusion models demonstrate that our method effectively accelerates the forgetting of target classes or concepts during generation, while preserving the quality of other classes or concepts. This unlearned and distilled diffusion not only pioneers a novel concept in MU but also accelerates the generation speed of diffusion models. Our experiments and studies on a range of diffusion models and datasets confirm that our approach is generalizable, effective, and advantageous for MU in diffusion models.

174Generate explorative goals with large language model guidance

[openreview] [pdf]

Abstract Reinforcement learning (RL) struggles with sparse reward environments. Recent developments in intrinsic motivation have revealed the potential of language models to guide agents in exploring the environment. However, the mismatch between the granularity of environment transitions and natural language descriptions hinders effective exploration for current methods. To address this problem, we introduce a model-based RL method named Language-Guided Explorative Goal Generation (LanGoal), which combines large language model (LLM) guidance with intrinsic exploration reward by learning to propose meaningful goals. LanGoal learns a hierarchical policy together with a world model. The high-level policy learns to propose goals based on LLM guidance to explore the environment, and the low-level policy learns to achieve the goals. Extensive results on Crafter demonstrate the effectiveness of LanGoal compared to recent methods.

175Stability and Sharper Risk Bounds with Convergence RateO(1/n2)

[openreview] [pdf]

Abstract The sharpest known high probability excess risk bounds are up to O(1/n)O\left( 1/n \right) for empirical risk minimization and projected gradient descent via algorithmic stability (Klochkov & Zhivotovskiy, 2021). In this paper, we show that high probability excess risk bounds of order up to O(1/n2)O(1/n^2) are possible. We discuss how high probability excess risk bounds reach O(1/n2)O(1/n^2) under strongly convexity, smoothness and Lipschitz continuity assumptions for empirical risk minimization, projected gradient descent and stochastic gradient descent. Besides, to the best of our knowledge, our high probability results on the generalization gap measured by gradients for nonconvex problems are also the sharpest.

176Learning Conditionally Independent Marginals Enables Logical Compositions in Conditional Diffusion Models

[openreview] [pdf]

Abstract How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes’ conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher’s divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.

177Diffusion-Guided Safe Policy Optimization From Cost-Label-Free Offline Dataset

[openreview] [pdf]

Abstract Offline safe reinforcement learning (RL) aims to guarantee the safety of decision-making in both training and deployment phases by learning the safe policy entirely from offline data without further interaction with the environment, which pushes the RL towards real-world applications. Previous efforts in offline safe RL typically presume the presence of Markovian costs within the dataset. However, the design of a Markovian cost function involves rehearsal of all potentially unsafe cases, which is inefficient and even unfeasible in many practical tasks. In this work, we take a further step forward by learning a safe policy from an offline dataset without any cost labels, but with a small number of safe demonstrations included. To solve this problem, we propose a two-stage optimization method calledDiffusion-guidedSafePolicyOptimization (DSPO). Initially, we derive trajectory-wise safety signals by training a return-agnostic discriminator. Subsequently, we train a conditional diffusion model that generates trajectories conditioned both on the trajectory return and the safety signal. Remarkably, the trajectories generated by our diffusion model not only yield high returns but also comply with the safety signals, from which we can derive a desirable policy through behavior cloning (BC). The evaluation experiments conducted across tasks from the SafetyGym, BulletGym, and MetaDrive environments demonstrate that our approach can achieve a safe policy with high returns, significantly outperforming various established baselines.

178KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies in Curiosity-driven Exploration

[openreview] [pdf]

Abstract In continuous control tasks, Soft Actor-Critic (SAC) has achieved notable success by balancing exploration and exploitation. However, SAC struggles in sparse reward environments, where infrequent rewards hinder efficient exploration. While curiosity-driven exploration methods help address this issue by encouraging the agent to explore novel states, they introduce challenges, such as the difficulty of setting an optimal reward scale and managing the interaction between curiosity-based exploration and SAC’s stochastic policy. These complexities often lead to inefficient exploration or premature convergence and make balancing exploration-exploitation challenging. In this paper, we propose KEA (Keeping Exploration Alive) to tackle the inefficiencies in balancing the exploration-exploitation trade-off when combining SAC with curiosity-based methods. KEA introduces an additional co-behavior agent that works alongside SAC and a switching mechanism to facilitate proactive coordination between exploration strategies from the co-behavior agent and the SAC agent with curiosity-based exploration. This coordination allows the agent to maintain stochasticity in high-novelty regions, preventing premature convergence and enhancing exploration efficiency. We first analyze the difficulty of balancing exploration-exploitation when combining SAC with curiosity-based methods in a 2D grid environment. We then evaluate KEA on sparse reward control tasks from the DeepMind Control Suite and compare against two state-of-the-art curiosity-based exploration baselines — Random Network Distillation (RND) and NovelD. KEA improves episodic rewards by up to 119% over RND and 28% over NovelD, significantly improving learning efficiency and robustness in sparse reward environments.

179Practical alignment requires more than learning from human feedback

[openreview] [pdf]

Abstract Ensuring the alignment of artificial intelligence (AI) systems with human objectives is a critical challenge in the development of safe and effective AI technologies. Reinforcement learning from human feedback (RLHF) has been a predominant method to tackle this challenge. However, this framework operates under the unrealistic assumptions that human preferences are accurate reflections of their desires and that they remain constant over time. This paper identifies and challenges these assumptions by illustrating how they can lead to undesirable consequences, particularly when human beliefs about the environment are incorrect or mutate over time. To address these challenges, we introduce a novel framework termed practical alignment. This framework redefines the alignment objective to accommodate the variability and irrationality of human beliefs, emphasizing the need for AI systems not only to learn from but also to teach humans about the world. We discuss the theoretical underpinnings of practical alignment and introduce MindGrid, a toolkit designed to simulate and evaluate alignment scenarios. Our experimental results using large language models in teaching scenarios underscore the importance of teaching skills as a requisite capability to achieve alignment.

180TerDiT: Ternary Diffusion Models with Transformers

[openreview] [pdf]

Abstract Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their excessive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we proposeTerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion transformer models. We focus on the ternarization of DiT networks, with model sizes ranging from 600M to 4.2B, and image resolution from 256×256 to 512×512. Our work contributes to the exploration of efficient deployment of large-scale DiT models, demonstrating the feasibility of training extremely low-bit DiT models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code has been uploaded in the supplemental materials.

181UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models

[openreview] [pdf]

Abstract We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models like the Diffusion transformer. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. UniCon is free from the constrains of encoder-focused designs and is able to utilize all parameters of the diffusion model, making it highly effective for transformer-based architectures. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while all maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image condition generation tasks, UniCon has demonstrated precise response to control information and excellent generation capabilities. UniCon makes the control of large-scale diffusion models feasible and provides a basis for further scaling up of diffusion models.

182Understanding Scale Shift in Domain Generalization for Crowd Localization

[openreview] [pdf]

Abstract Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to differences in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address three key questions: (i) how to quantify the scale shift influence on DG task, (ii) why does this influence occur, (iii) how to mitigate the influence. Specifically, we first establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms, to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and highlight the under-explored nature of this issue. To further understand its behind reason, we provide a rigorous theoretical analysis on scale shift. Building on this analysis, we further propose a simple yet effective algorithm called Semantic Hook to mitigate the influence of scale shift on DG, which also serves as a case study revealing three significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization\textit{Scale Shift Domain Generalization}.

183Bandits with Anytime Knapsacks

[openreview] [pdf]

Abstract We consider bandits with anytime knapsacks (BwAK), a novel version of the BwK problem where there is an anytime cost constraint instead of a total cost budget. This problem setting introduces additional complexities as it mandates adherence to the constraint throughout the decision-making process. We propose SUAK, a novel algorithm that utilizes upper confidence bounds to identify the optimal mixture of arms while maintaining a balance between exploration and exploitation. SUAK is an adaptive algorithm that strategically utilizes the available budget in each round in the decision-making process and skips a round when it is possible to violate the anytime cost constraint. In particular, SUAK slightly under-utilizes the available cost budget to reduce the need for skipping rounds. We show that SUAK attains the same problem-dependent regret upper bound of O(KlogT)O(K \log T) established in prior work under the simpler BwK framework. Finally, we provide simulations to verify the utility of SUAK in practical settings.

184CoLa-DCE – Concept-guided Latent Diffusion Counterfactual Explanations

[openreview] [pdf]

Abstract Recent advancements in generative AI have introduced novel prospects and prac- tical implementations. Especially diffusion models show their strength in gener- ating diverse and, at the same time, realistic features, positioning them well for generating counterfactual explanations for computer vision models. Answering “what if” questions of what needs to change to make an image classifier change its prediction, counterfactual explanations align well with human understanding and consequently help in making model behavior more comprehensible. Current methods succeed in generating authentic counterfactuals, but lack transparency as feature changes are not directly perceivable. To address this limitation, we intro- duce Concept-guided Latent Diffusion Counterfactual Explanations (CoLa-DCE). CoLa-DCE generates concept-guided counterfactuals for any classifier with a high degree of control regarding concept selection and spatial conditioning. The coun- terfactuals comprise an increased granularity through minimal feature changes. The reference feature visualization ensures better comprehensibility, while the feature localization provides increased transparency of “where” changed “what”. We demonstrate the advantages of our approach in minimality and comprehen- sibility across multiple image classification models and datasets and provide in- sights into how our CoLa-DCE explanations help comprehend model errors like misclassification cases.

185Controlling Information Leakage in Concept Bottleneck Models with Trees

[openreview] [pdf]

Abstract As AI models grow larger, the demand for accountability and interpretability has become increasingly critical for understanding their decision-making processes. Concept Bottleneck Models (CBMs) have gained attention for enhancing interpretability by mapping inputs to intermediate concepts before making final predictions. However, CBMs often suffer from information leakage, where additional input data, not captured by the concepts, is used to improve task performance, complicating the interpretation of downstream predictions. In this paper, we introduce a novel approach for training both joint and sequential CBMs that allows us to identify and control leakage using decision trees. Our method quantifies leakage by comparing the decision paths of hard CBMs with their soft, leaky counterparts. Specifically, we show that soft leaky CBMs extend the decision paths of hard CBMs, particularly in cases where concept information is incomplete. Using this insight, we develop a technique to better inspect and manage leakage, isolating the subsets of data most affected by this. Through synthetic and real-world experiments, we demonstrate that controlling leakage in this way not only improves task accuracy but also yields more informative and transparent explanations.

186Diffusion-based Decoupled Deterministic and Uncertain Framework for Probabilistic Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Diffusion-based denoising models have demonstrated impressive performance in probabilistic forecasting for multivariate time series (MTS). Nonetheless, existing approaches often model the entire data distribution, neglecting the variability in uncertainty across different components of the time series. This paper introduces a Diffusion-based Decoupled Deterministic and Uncertain (D3U\mathrm{D^3U}) framework for probabilistic MTS forecasting. The framework integrates non-probabilistic forecasting with conditional diffusion generation, enabling both accurate point predictions and probabilistic forecasting. D3U\mathrm{D^3U} utilizes a point forecasting model to non-probabilistically model high-certainty components in the time series, generating embedded representations that are conditionally injected into a diffusion model. To better model high-uncertainty components, a patch-based denoising network (PatchDN) is designed in the conditional diffusion model. Designed as a plug-and-play framework, D3U\mathrm{D^3U} can be seamlessly integrated into existing point forecasting models to provide probabilistic forecasting capabilities. It can also be applied to other conditional diffusion methods that incorporate point forecasting models. Experiments on six real-world datasets demonstrate that our method achieves over a 20% improvement in both point and probabilistic forecasting performance in MTS long-term forecasting compared to state-of-the-art (SOTA) methods. Additionally, extensive ablation studies further validate the effectiveness of the D3U\mathrm{D^3U} framework.

187On the feature learning in diffusion models

[openreview] [pdf]

Abstract The predominant success of diffusion models in generative modeling has spurred significant interest in understanding their theoretical foundations. In this work, we propose a feature learning framework aimed at analyzing and comparing the training dynamics of diffusion models with those of traditional classification models. Our theoretical analysis demonstrates that, under identical settings, neural networks trained for classification tend to prioritize learning specific patterns in the data, often focusing on easy-to-learn features. In contrast, diffusion models, due to the denoising objective, are encouraged to learn more balanced and comprehensive representations of the data. To support these theoretical insights, we conduct several experiments on both synthetic and real-world datasets, which empirically validate our findings and underscore the distinct feature learning dynamics in diffusion models compared to classification.

188HP3O: Hybrid-Policy Proximal Policy Optimization with Best Trajectory

[openreview] [pdf]

Abstract Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms that has become a standard baseline in modern reinforcement learning with applications in numerous fields. Though it delivers stable performance with theoretical policy improvement guarantees, high variance and high sample complexity still remain critical challenges in on-policy algorithms. To alleviate these issues, we propose Hybrid-Policy Proximal Policy Optimization (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies. Particularly, the buffer applies the “first in, first out” (FIFO) strategy so as to keep only the recent trajectories to attenuate the data distribution drift. A batch consisting of the trajectory with the best return and other randomly sampled ones from the buffer is used for updating the policy networks. The strategy helps the agent to improve its capability on top of the most recent best performance and in turn reduce variance empirically. We theoretically construct the policy improvement guarantees for the proposed algorithm. HP3O is validated and compared against several baseline algorithms using multiple continuous control environments. Our code is available here.

189TopoDiffusionNet: A Topology-aware Diffusion Model

[openreview] [pdf]

Abstract Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constraint. This limitation restricts their utility in applications requiring exact control, like robotics and environmental modeling. To address this, we propose TopoDiffusionNet (TDN), a novel approach that enforces diffusion models to maintain the desired topology. We leverage tools from topological data analysis, particularly persistent homology, to extract the topological structures within an image. We then design a topology-based objective function to guide the denoising process, preserving intended structures while suppressing noisy ones. Our experiments across four datasets demonstrate significant improvements in topological accuracy. TDN is the first to integrate topology with diffusion models, opening new avenues of research in this area.

190Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

[openreview] [pdf]

Abstract Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. Surprisingly, for most data sets, we observe that applying no rebalancing strategy is competitive in terms of predictive performances, with tuned random forests, logistic regression or LightGBM. For highly imbalanced data sets, our new methods, named CV-SMOTE and Multivariate Gaussian SMOTE, are competitive. Besides, our analysis sheds some lights on the behavior of common rebalancing strategies, when used in conjunction with random forests.

191q-exponential family for policy optimization

[openreview] [pdf]

Abstract Policy optimization methods benefit from a simple and tractable policy parametrization, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the qq-exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies (q>1q>1) and light-tailed policies (q<1q<1). This paper examines the interplay between qq-exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student’s t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed qq-Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems.

192Refining Counterfactual Explanations With Joint-Distribution-Informed Shapley Towards Actionable Minimality

[openreview] [pdf]

Abstract Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanations that are presented to users and stakeholders. We address this problem by proposing a method that minimizes the required feature changes while maintaining the validity of CE, without imposing restrictions on models or CE algorithms, whether instance- or group-based. The key innovation lies in computing a joint distribution between observed and counterfactual data and leveraging it to inform Shapley values for feature attributions (FA). We demonstrate that optimal transport (OT) effectively derives this distribution, especially when the alignment between observed and counterfactual data is unclear in used CE methods. Additionally, a counterintuitive finding is uncovered: it may be misleading to rely on an exact alignment defined by the CE generation mechanism in conducting FA. Our proposed method is validated on extensive experiments across multiple datasets, showcasing its effectiveness in refining CE towards greater actionable efficiency.

193Searching For Robust Point Cloud Distillation

[openreview] [pdf]

Abstract Deep Neural Networks (DNNs) have shown remarkable performance in machine learning; however, their vulnerabilities to adversarial attacks have been exposed, particularly in point cloud data. Neural Architecture Search (NAS) is a technique for discovering new neural architectures with high predictive accuracy, yet its potential for enhancing model robustness against adversarial attacks remains largely unexplored. In this study, we investigate the application of NAS within the framework of knowledge distillation, aiming to generate robust student architectures that inherit resilience from robust teacher models. We introduce RDANAS, an effective NAS method that utilizes cross-layer knowledge distillation from robust teacher models to enhance the robustness of the student model. Unlike previous studies, RDANAS considers the teacher model’s outputs and automatically identifies the optimal teacher layer for each student layer during supervision. Experimental results on ModelNet40, ScanObjectNN and ScanNet datasets demonstrate the efficacy of RDANAS, revealing that the neural architectures it generates are compact and possess adversarial robustness, which shows potential in multiple applications.

194Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting

[openreview] [pdf]

Abstract Self-supervised learning has become an essential and popular approach for enhancing time series forecasting, enabling models to learn universal representations from unlabeled data. However, effectively capturing both the global sequence dependence and local detail features within time series data remains challenging. To address this, we propose a novel generative self-supervised method called TimeDART, denoting Diffusion Auto-regressive Transformer for Time series forecasting. In TimeDART, we treat time series patches as basic modeling units. For one thing, we employ an self-attention based Transformer encoder to model the dependencies of inter-patches. For another, we introduce diffusion and denoising mechanisms to capture the locality features of intra-patch. Notably, we design a cross-attention-based denoising decoder that allows for adjustable optimization difficulty in the self-supervised task, facilitating more effective self-supervised pre-training. Extensive experiments demonstrate that TimeDART achieves state-of-the-art fine-tuning performance compared to the most advanced competitive methods in forecasting tasks. Our code is publicly available athttps://anonymous.4open.science/r/TimeDART-2024.

195Concepts’ Information Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) offer a self-explainable AI framework by predicting targets based on human-understandable concepts, but they often fail to achieve optimal performance and interpretability due to leakage of irrelevant information into the concept activations. This paper presents an information-theoretic enhancement of CBMs through the integration of the Information Bottleneck (IB) framework, aimed at addressing their issues of concept leakage and reduced performance. Our approach reshapes the way CBMs process and utilize concepts by constraining mutual information between input data and concepts, ensuring that only the most relevant information is preserved for decision-making. This introduces a new paradigm for CBMs that not only enhances performance but also enforces a tighter connection between latent representations and human-understandable concepts, ensuring a more robust and interpretable model. Our experiments on datasets such as CUB, AwA2, and aPY demonstrate that IB-augmented CBMs improve both concept and target prediction accuracy, while also increasing intervenability. Additionally, we propose a novel metric to assess the quality of concept sets based on intervention performance. Unlike traditional task performance metrics, which may obscure the effects of concept leakage, the new metric offers a direct, interpretable evaluation of concept set goodness.

196Online Policy Selection for Inventory Problems

[openreview] [pdf]

Abstract We tackle online inventory problems where at each time period the manager makes a replenishment decision based on partial historical information in order to meet demands and minimize costs. To solve such problems, we build upon recent works in online learning and control, use insights from inventory theory and propose a new algorithm called GAPSI. This algorithm follows a new feature-enhanced base-stock policy and deals with the troublesome question of non-differentiability which occurs in inventory problems. Our method is illustrated in the context of a complex and novel inventory system involving multiple products, lost sales, perishability, warehouse-capacity constraints and lead times. Extensive numerical simulations are conducted to demonstrate the good performances of our algorithm on real-world data.

197Outward Odyssey: Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) heavily depends on well-designed reward functions, which can be challenging to create and may introduce biases, especially for complex behaviors. Preference-based RL (PbRL) addresses this by using human feedback to construct a reward model that reflects human preferences, yet requiring considerable human involvement. To alleviate this, several PbRL methods aim to select queries that need minimal feedback. However, these methods do not directly enhance the data coverage within the preference buffer. In this paper, to emphasize the critical role of preference buffer coverage in determining the quality of the reward model, we first investigate and find that a reward model’s evaluative accuracy is the highest for trajectories within the preference buffer’s distribution and significantly decreases for out-of-distribution trajectories. Against this phenomenon, we introduce theProximal Policy Exploration (PPE)algorithm, which consists of aproximal-policy extensionmethod and amixture distribution querymethod. To achieve higher preference buffer coverage, theproximal-policy extensionmethod encourages active exploration of data within near-policy regions that fall outside the preference buffer’s distribution. To balance the inclusion of in-distribution and out-of-distribution data, themixture distribution querymethod proactively selects a mix of data from both outside and within the preference buffer’s distribution for querying. PPE not only expands the preference buffer’s coverage but also ensures the reward model’s evaluative capability for in-distribution data. Our comprehensive experiments demonstrate that PPE achieves significant improvement in both human feedback efficiency and RL sample efficiency, underscoring the importance of preference buffer coverage in PbRL tasks.

198Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation

[openreview] [pdf]

Abstract We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.

199D3PM: Diffusion Model Responds to the Duty Call from Causal Discovery

[openreview] [pdf]

Abstract Causal discovery (CD) involves inferring cause-and-effect relationships as directed acyclic graphs (DAGs). In this work, we assume that the data is generated by an additive noise model (ANM). Recent work has formulated the problem as a continuous optimization problem, which consists of solving an inverse problem and satisfying an acyclicity constraint. However, solving the inverse problem in CD is often unstable, i.e. high sensitivity of the effects to perturbations in the causes. To address this instability, we formulate the inverse problem as a regularized optimization scheme and propose a novel variation-negotiation regularizer. Compared to traditional regularization techniques for the continuous optimization problem, e.g. 1\ell_1 penalty on graphs, the proposed regularizer exploits the variation variable in ANMs to stabilize the solutions (i.e. DAGs). This regularizer is advantageous as it does not rely on any hypotheses, such as graph sparsity, about true DAGs. The variation-negotiation regularizer regulates the DAG purely based on observed data.Building on the proposed regularizer, a series of improvements to the regularized optimization scheme reveal the connections between solving the regularized optimization problem and learning a diffusion model, as they share comparable objective functions. This insight leads us to develop an equivalent diffusion model called DAG-invariant Denoising Diffusion Probabilistic Model. Extensive empirical experiments on synthetic and real datasets demonstrate that the proposed diffusion model achieves outstanding performance on all datasets.

200Domain Shift Tuning over Knowledge Gap

[openreview] [pdf]

Abstract This paper introduces Domain Shift Tuning (DST), a novel framework designed to guide pre-trained language models (PLMs), including Large Language Models (LLMs), in overcoming domain discrepancies (i.e., source-target). PLMs, pre-trained on extensive and diverse corpora, the source domain, often encounter domain gaps after fine-tuning over the target domain. Unlike conventional adapters or Parameter-Efficient Fine-Tuning (PEFT) methods, DST conceptualizes domain gaps as differences in knowledge encapsulated within multiple subnetworks of PLMs. To bridge this gap, our challenge is to find a subnetwork set that corresponds to these pieces of knowledge and their weight. This direction leads DST to employ a lightweight subnetwork, the Knowledge Steering Layer (KSL), and a training objective, Knowledge Distribution Modeling (KDM). These components enable DST to fine-tune PLMs by aligning the knowledge weights of the source domain with those of the target domain. Experimental results on diverse datasets demonstrate that DST effectively mitigates the domain gap, allowing PLMs to generate text that closely aligns with even a small target corpus, thereby significantly enhancing domain adaptation for PLMs at lower computational cost.

201Deployment Efficient Reward-Free Exploration with Linear Function Approximation

[openreview] [pdf]

Abstract We study deployment efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of exploration policies used during the algorithm. We design a new reinforcement learning (RL) algorithm whose sample complexity is polynomial in the feature dimension and horizon length, while achieving nearly optimal deployment efficiency for linear MDPs under the reward-free exploration setting. More specifically, our algorithm explores a linear MDP in a reward-free manner, while using at most HH exploration policies during its execution where HH is the horizon length. Compared to previous algorithms with similar deployment efficiency guarantees, the sample complexity of our algorithm does not depend on the reachability coefficient or the explorability coefficient of the underlying MDP, which can be arbitrarily small for certain MDPs. Our result addresses an open problem proposed in prior work. To achieve such a result, we show how to truncate state-action pairs of the underlying linear MDP in a data-dependent manner, and devise efficient offline policy evaluation and offline policy optimization algorithms in the truncated linear MDP. We further show how to implement reward-free exploration mechanisms in the linear function approximation setting by carefully combines these offline RL algorithms without sacrificing the deployment efficiency.

202Channel-aware Contrastive Conditional Diffusion for Multivariate Probabilistic Time Series Forecasting

[openreview] [pdf]

Abstract Forecasting faithful trajectories of multivariate time series from practical scopes is essential for reasonable decision-making. Recent methods majorly tailor generative conditional diffusion models to estimate the target temporal predictive distribution. However, it remains an obstacle to enhance the exploitation efficiency of given implicit temporal predictive information to bolster conditional diffusion learning. To this end, we propose a generic channel-aware contrastive conditional diffusion model termed CCDM to achieve desirable multivariate probabilistic forecasting, obviating the need for curated temporal conditioning inductive biases. In detail, we first design a channel-centric conditional denoising network to manage intra-variate variations and cross-variate correlations, which can lead to scalability on diverse prediction horizons and channel numbers. Then, we devise an ad-hoc denoising-based temporal contrastive learning to explicitly amplify the predictive mutual information between past observations and future forecasts. It can coherently complement naive step-wise denoising diffusion training and improve the forecasting accuracy and generality on unknown test time series. Besides, we offer theoretic insights on the benefits of such auxiliary contrastive training refinement from both neural mutual information and temporal distribution generalization aspects. The proposed CCDM can exhibit superior forecasting capability compared to current state-of-the-art diffusion forecasters over a comprehensive benchmark, with best MSE and CRPS outcomes on 66.67% and 83.33% cases.

203Dataset Distillation for Domain Generalization

[openreview] [pdf]

Abstract Dataset Distillation (DD) has been applied to various downstream tasks and recently scaled to ImageNet-1k, highlighting its potential for practical applications. However, in real-world scenarios, robustness to unseen domains is essential, and the robustness of models trained on synthetic datasets remains uncertain. To address this, we propose a novel task, Dataset Distillation for Domain Generalization (DD for DG), and evaluate the unseen domain generalization of models trained on synthetic datasets distilled by state-of-the-art DD methods using the DomainBed benchmark. Additionally, we introduce a new method for this task, which interprets DD through the lens of image style transfer, achieving superior performance in unseen domain generalization compared to baseline approaches.

204Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

[openreview] [pdf]

Abstract Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model’s performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

205Counterfactual Delayed Feedback Learning

[openreview] [pdf]

Abstract Estimation of heterogeneous treatment effects has gathered much attention in recent years and has been widely adopted in medicine, economics, and marketing. Previous studies assumed that one of the potential outcomes of interest could be observed timely and accurately. However, a more practical scenario is that treatment takes time to produce causal effects on the outcomes. For example, drugs take time to produce medical utility for patients and users take time to purchase items after being recommended, and ignoring such delays in feedback can lead to biased estimates of heterogeneous treatment effects. To address the above problem, we study the impact of observation time on estimating heterogeneous treatment effects by further considering the potential response time that potential outcomes have. We theoretically prove the identifiability results and further propose a principled learning approach, known as CFR-DF (Counterfactual Regression with Delayed Feedback), to simultaneously learn potential response times and potential outcomes of interest. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.

206Influence Functions for Scalable Data Attribution in Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by extending influence functions. Influence function-based data attribution methods approximate how a model’s output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we use a K-FAC approximation based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We show that our recommended method outperforms previously proposed data attribution methods on common data attribution evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

207AVID: Adapting Video Diffusion Models to World Models

[openreview] [pdf]

Abstract Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learnt mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation. Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

208SHIFT-RESILIENT DIFFUSIVE IMPUTATION FOR VARIABLE SUBSET FORECASTING

[openreview] [pdf]

Abstract It is common for sensor failures to result in missing data, leading to training sets being complete while test sets have only a small subset of variables. The challenge lies in utilizing incomplete data for forecasting, which is known as the Variable Subset Forecasting (VSF). In VSF tasks, significant distribution shift is present. One type is inter-series shift, which indicates changes in correlations between different series, and the other type is intra-series shift, which refers to substantial distribution differences within the same series across different time windows. Existing approaches to solving VSF tasks typically involve imputing the missing data first and then making predictions using the completed series. However, these methods do not account for the shift inherent in VSF tasks, resulting in poor model performance. To address these challenges, we propose a Shift-Resilient Diffusive Imputation (SRDI) framework against the shift. Specifically, SRDI integrates divide-conquer strategy with the denoising process, that decomposes the input into invariant patterns and variant patterns, representing the temporally stable parts of inter-series correlation and the highly fluctuating parts, respectively. By extracting spatiotemporal features from each separately and then appropriately combining them, inter-series shift can be effectively mitigated. Then, we innovatively organize SRDI and the forecasting model into a meta-learning paradigm tailored for VSF scenarios. We address the intra-series shift by treating time windows as tasks during training and employing an adaptation process before testing. Extensive experiments on four datasets have demonstrated our superior performance compared with state-of-the-art methods.

209Learning-Guided Rolling Horizon Optimization for Long-Horizon Flexible Job-Shop Scheduling

[openreview] [pdf]

Abstract Long-horizon combinatorial optimization problems, such as the Flexible Job-Shop Scheduling Problem (FJSP), often involve complex, interdependent decisions over extended time frames, posing significant challenges for existing solvers. While Rolling Horizon Optimization (RHO) addresses this by decomposing problems into overlapping shorter-horizon subproblems, such overlap often leads to redundant computations. In this paper, we present L-RHO, the first learning-guided RHO framework for long-horizon FJSP. L-RHO employs a customized attention-based model to intelligently fix variables that in hindsight did not need to be re-optimized, resulting in smaller and thus easier-to-solve subproblems. For FJSP, this means identifying operations with unchanged machine assignments between two consecutive subproblems. Empirically, L-RHO accelerates RHO by up to 54% while showing significantly improved solution quality, enabling it to outperform other heuristic and learning-based baselines. We also provide in-depth discussions and verify the desirable adaptability and generalization of L-RHO across various FJSP settings, distributions, and online scenarios. Moreover, we provide a theoretical analysis to elucidate the conditions under which learning is beneficial.

210Make Interval Bound Propagation great again

[openreview] [pdf]

Abstract In various scenarios motivated by real life, such as medical data analysis, autonomous driving, and adversarial training, we are interested in robust deep networks. A network is robust when a relatively small perturbation of the input cannot lead to drastic changes in output (like change of class, etc.). This falls under the broader scope field of Neural Network Certification (NNC). Two crucial problems in NNC are of profound interest to the scientific community: how to calculate the robustness of a given pre-trained network and how to construct robust networks. The common approach to constructing robust networks is Interval Bound Propagation (IBP). This paper demonstrates that IBP is sub-optimal in the first case due to its susceptibility to the wrapping effect. Even for linear activation, IBP gives strongly sub-optimal bounds. Consequently, one should use strategies immune to the wrapping effect to obtain bounds close to optimal ones. We adapt two classical approaches dedicated to strict computations -- Dubleton Arithmetic and Affine Arithmetic -- to mitigate the wrapping effect in neural networks. These techniques yield precise results for networks with linear activation functions, thus resisting the wrapping effect. As a result, we achieve bounds significantly closer to the optimal level than IBPs.

211Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

[openreview] [pdf]

Abstract We study the reinforcement learning (RL) problem with trajectory feedback. The trajectory feedback based reinforcement learning problem, where the learner can only observe the accumulative noised reward along the trajectory, is particularly suitable for the practical scenarios where the agent suffers extensively from querying the reward in each single step. For a finite-horizon Markov Decision Process (MDP) with SS states, AA actions and a horizon length of HH, we develop an algorithm that enjoys an optimal regret of O~(SAH3K)\tilde{O}\left(\sqrt{SAH^3K}\right) in KK episodes for sufficiently large KK. To achieve this, our technical contributions are two-fold: (1) we incorporate reinforcement learning with linear bandits problem to construct a tighter confidence region for the reward function; (2) we construct a reference transition model to better guide the exploration process.

212AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

[openreview] [pdf]

Abstract Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose \ourwork, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model (VLM) agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.

213Incorporating continuous dependence implies better generalization ability

[openreview] [pdf]

Abstract When applying deep-learning-based solvers to differential equations, a key challenge is how to improve their generalization ability, so that the pre-trained models could be easily adapted to new scenarios of interest. In this paper, inspired by the well-known mathematical statements on the continuous dependence of solutions to ordinary differential equations on initial values and parameters, we make a non-trivial extension of the physics-informed neural networks by incorporating additional information on the continuous dependence of solutions (abbreviated as cd-PINN). Our cd-PINN integrates the advantages of neural operators and Meta-PINN, requiring only few labeled data while enabling solving ordinary differential equations with respect to new initial values and parameters in a fast and accurate way without fine-tuning. As demonstrated through novel examples like the Logistic model, the Lotka-Volterra model as well as damped harmonic oscillators, the accuracy of cd-PINN under those untrained conditions is usually 1-3 orders of magnitude higher than PINN. Meanwhile, the GPU time cost for training in the two approaches is comparable. Therefore, we expect our cd-PINN would be particularly useful in improving the efficiency and accuracy of deep-learning-based solvers for differential equations.

214Uncertainty-aware Guided Diffusion for Missing Data in Sequential Recommendation

[openreview] [pdf]

Abstract Denoising diffusion models (DDMs) have shown significant potential in generating oracle items that best match user preference with guidance from user historical interaction sequences. However, the quality of guidance is often compromised by the unpredictable missing data in the observed sequence, leading to suboptimal item generation. To tackle this challenge, we propose a novel uncertainty-aware guided diffusion model (DreamMiss) to alleviate the influence of missing data. The core of DreamMiss is the utilization of a dual-side Thompson sampling (DTS) strategy, which simulates the stochastical mechanism of missing data without disrupting preference evolution. Specifically, we first define dual-side probability models to capture user preference evolution, taking into account both local item continuity and global sequence stability. We then strategically remove items based on these two models with DTS, creating uncertainty-aware guidance for DDMs to generate oracle items. This can achieve DDMs’ consistency regularization, enabling them to resile against uncertain missing data. Additionally, to accelerate sampling in the reverse process, DreamMiss is implemented under the framework of denoising diffusion implicit models (DDIM). Extensive experimental results show that DreamMiss significantly outperforms baselines in sequential recommendation.

215DyDiff: Long-Horizon Rollout via Dynamics Diffusion for Offline Reinforcement Learning

[openreview] [pdf]

Abstract With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs’ ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction. Our code is athttps://anonymous.4open.science/r/DyDiff.

216The Convergence of Second-Order Sampling Methods for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generating samples from complex distributions, notably in the domains of images and videos. Beyond the experimental success, theoretical insights into their performance have been illuminated, particularly concerning the convergence of diffusion models when applied with discretization methods such as Euler-Maruyama (EM) and Exponential Integrator (EI). This paper embarks on analyzing the convergence of the higher-order discretization method (SDE-DPM-2) under L2L^2-accurate score estimate. Our findings reveal that to attain O~(ϵ02)\tilde{O}(\epsilon_0^2) Kullback-Leibler (KL) divergence between the target and the sampled distributions, the sampling complexity - or the required number of discretization steps - for SDE-DPM-2 is O~(1/ϵ0)\tilde{O}(1/\epsilon_0), which is better than the currently known sample complexity of EI given by O~(1/ϵ02)\tilde{O}(1/\epsilon_0^2). We further extend our analysis to the Runge-Kutta-2 (RK-2) method, which demands a sampling complexity of O~(1/ϵ02)\tilde{O}(1/\epsilon_0^2), indicating that SDE-DPM-2 is more efficient than RK-2. Our study also demonstrates that the convergence of SDE-DPM-2 under Variance Exploding (VE) SDEs aligns with that of Variance Preserving (VP) SDEs, highlighting the adaptability of SDE-DPM-2 across various diffusion models frameworks.

217An Efficient Framework for Crediting Data Contributors of Diffusion Models

[openreview] [pdf]

Abstract As diffusion models are deployed in real-world settings and their performance driven by training data, appraising the contribution of data contributors is crucial to creating incentives for sharing quality data and to implementing policies for data compensation. Depending on the use case, model performance corresponds to various global properties of the distribution learned by a diffusion model (e.g., overall aesthetic quality). Hence, here we address the problem of attributing global properties of diffusion models to data contributors. The Shapley value provides a principled approach to valuation by uniquely satisfying game-theoretic axioms of fairness. However, estimating Shapley values for diffusion models is computationally impractical because it requires retraining and rerunning inference on many subsets of data contributors. We introduce a method to efficiently retrain and rerun inference for Shapley value estimation, by leveraging model pruning and fine-tuning. We evaluate the utility of our method with three use cases: (i) image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks. Our results empirically demonstrate that our framework can identify important data contributors across global properties, outperforming existing attribution methods for diffusion models.

218Policy Gradient with Tree Expansion

[openreview] [pdf]

Abstract Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

219DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

[openreview] [pdf]

Abstract Recent advancements in generative models have sparked a significant interest within the machine learning community. Particularly, diffusion models have demonstrated remarkable capabilities in synthesizing images and speech. Studies such as those by Lee et al. (2023), Black et al. (2023), Wang et al. (2023), and Fan et al. (2024) illustrate that Reinforcement Learning with Human Feedback (RLHF) can enhance diffusion models for image synthesis. However, due to architectural differences between these models and those employed in speech synthesis, it remains uncertain whether RLHF could similarly benefit speech synthesis models. In this paper, we explore the practical application of RLHF to diffusion-based text-to-speech synthesis, leveraging the mean opinion score (MOS) as predicted by UTokyo-SaruLab MOS prediction system (Saeki et al., 2022) as a proxy loss. We introduce diffusion model loss-guided RL policy optimization (DLPO) and compare it against other RLHF approaches, employing the NISQA speech quality and naturalness assessment model (Mittag et al., 2021) and human preference experiments for further evaluation. Our results show that RLHF can enhance diffusion-based text-to-speech synthesis models, and, moreover, DLPO can better improve diffusion models in generating natural and high quality speech audios.

220Learning Loss Landscapes in Preference Optimization

[openreview] [pdf]

Abstract We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms. Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance. To address this issue, we introduce a novel PO framework based on mirror descent, which can recover existing methods like Direct Preference Optimization (DPO) and Odds-Ratio Preference Optimization (ORPO) for specific choices of the mirror map. Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios. These new loss functions lead to significant performance improvements over DPO and ORPO across several tasks. Additionally, we demonstrate the generalization capability of our approach by applying the discovered loss functions to fine-tuning large language models using mixed-quality data, where they outperform ORPO.

221Time Can Invalidate Algorithmic Recourse

[openreview] [pdf]

Abstract Algorithmic Recourse (AR) aims to provide users with actionable steps to overturn unfavourable decisions made by machine learning predictors. However, these actions often take time to implement (e.g., getting a degree can take years), and their effects may vary as the world evolves. Thus, it is natural to ask for recourse that remains valid in a dynamic environment. In this paper, we study the robustness of algorithmic recourse over time by casting the problem through the lens of causality. We demonstrate theoretically and empirically that (even robust) causal AR methods can fail over time except in the -- unlikely -- case that the world is stationary. Even more critically, unless the world is fully deterministic, counterfactual AR cannot be solved optimally. To account for this, we propose a simple yet effective algorithm for temporal AR that explicitly accounts for time. Our simulations on synthetic and realistic datasets show how considering time produces more resilient solutions to potential trends in the data distribution.

222Rethinking and Defending Protective Perturbation in Personalized Diffusion Models

[openreview] [pdf]

Abstract Personalized diffusion models (PDMs) have become prominent for adapting pretrained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to mitigate this issue but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic defense framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic meanings in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers fundamental shortcut learning vulnerabilities in PDMs but also provides a comprehensive evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its superiority over existing purification methods and stronger robustness against adaptive perturbation.

223Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning

[openreview] [pdf]

Abstract Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse polices recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovering. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair’s contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse polices from expert data.

224Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

[openreview] [pdf]

Abstract Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.

225FairGen: controlling fair generations in diffusion models via adaptive latent guidance

[openreview] [pdf]

Abstract Diffusion models have shown remarkable proficiency in generating photorealistic images, but their outputs often exhibit biases toward specific social groups, raising ethical concerns and limiting their wider adoption. This paper tackles the challenge of mitigating generative bias in diffusion models while maintaining image quality. We propose FairGen, an adaptive latent guidance mechanism enhanced by an auxiliary memory module, which operates during inference to control the generation distribution at a desired level. The latent guidance module dynamically adjusts the direction in the latent space to influence specific attributes, while the memory module tracks prior generation statistics and steers the scalar direction to align with the target distribution. To evaluate FairGen comprehensively, we introduce a bias evaluation benchmark tailored for diffusion models, spanning diverse domains such as employment, education, finance, and healthcare, along with complex user-generated prompts. Extensive empirical evaluations demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction while preserving generation quality. Furthermore, FairGen offers precise and flexible control over various target distributions, enabling nuanced adjustments to the generative process.

226Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

[openreview] [pdf]

Abstract Selecting appropriate training data is crucial for successful supervised instruction fine-tuning (SFT), which aims to (1) elicit strong capabilities from pretrained large language models (LLMs), and (2) achieve balanced performance across a diverse range of tasks. Algorithms based on influence estimation have shown promise in achieving (1) through estimating the contribution of each training example to model’s prediction on a downstream task, but often struggle with (2). Through systematic experiments, we attribute their underperformance to an inherent bias---certain tasks intrinsically have greater influence than others. Directly comparing influence scores across different tasks would thus bias the selected data towards these tasks, hurting the LM’s performance not only on other capabilities, but also, surprisingly, on the tasks for which the selected data has high influence.We propose BIDS, a novel Data Selection algorithm that targets Influential data in a Balanced way, to address this issue. Aiming to address the biased influence, BIDS first normalizes influence scores of the training data with respect to each downstream task at an instance level. BIDS then applies an iterative optimization process to further balance the selection of influential training data. At each step, BIDS selects the training example that bears the highest influence on the most underrepresented capability by the currently selected data. Experimental results demonstrate that BIDS consistently outperforms state-of-the-art influence-based data selection algorithms under various budgets. Remarkably, training on a 15% subset by BIDS can even outperform full-dataset training with a much more balanced distribution of downstream performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.

227Towards Adaptive Time Series Foundation Models Against Distribution Shift

[openreview] [pdf]

Abstract Foundation models have demonstrated remarkable success across diverse machine-learning domains through large-scale pretraining. However, their application to time series data poses challenges due to substantial mismatches in the distributions of pretraining datasets. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism, where learned prototypes represent distinct data distributions, and sample-to-prototype similarity determines the appropriate normalization layer. This approach effectively captures the diverse characteristics of time series data, ensuring better alignment between pretrained representations and downstream tasks. Our method significantly improves fine-tuning performance, outperforming vanilla pretraining techniques and reducing the negative impact of distribution shifts. Extensive experiments on various real-world time series datasets demonstrate the efficacy of our approach, paving the way for more robust and generalizable time series foundation models.

228Leveraging Diffusion Transformers for Stock Factor Augmentation in Financial Markets

[openreview] [pdf]

Abstract Data scarcity poses a significant challenge in training machine learning models for stock forecasting, often leading to low signal-to-noise ratio (SNR) and data homogeneity that degrade model performance. To address these issues, we introduce DiffsFormer, a novel approach utilizing artificial intelligence-generated samples (AIGS) with a Transformer-based Diffusion Model. Initially trained on a large-scale source domain with conditional guidance to capture global joint distribution, DiffsFormer augments training by editing existing samples for specific downstream tasks, allowing control over the deviation of generated data from the target domain. We evaluate DiffsFormer on the CSI300 and CSI800 datasets using eight commonly used machine learning models, achieving relative improvements of 7.3% and 22.1% in annualized return ratio, respectively. Extensive experiments provide insights into DiffsFormer’s functionality and its components, illustrating their role in mitigating data scarcity and enhancing model performance. Our findings demonstrate the potential of AIGS and DiffsFormer in addressing data limitations in stock forecasting, with the ability to generate realistic stock factors and control the editing process. These results validate our approach and contribute to a deeper understanding of its underlying mechanisms.

229Transformers Struggle to Learn to Search Without In-context Exploration

[openreview] [pdf]

Abstract Search is an ability fundamental in many important tasks, and recent studies have shown that large-language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use graph connectivity as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, under specific conditions on the training distribution, the transformer is able to learn to search.We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers.However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that simply increasing the scale of LLMs will not lead to robust search abilities.Finally, we show that by loosening the task to allow the model toexplorethe graphin-context, allowing the model to visit vertices that do not necessarily lead to the goal and backtrack, the transformer is able to more easily learn to search robustly.

230Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

[openreview] [pdf]

Abstract Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained strong navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data self-refining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior instruction generator, as reflected by the improved SPICE from 23.5 to 25.7, better than all published approaches tailored for VLN instruction generation. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art performance by a large margin in all cases. Code is uploaded as supplementary materials and all our data/code/models will also be publicly released.

231DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model

[openreview] [pdf]

Abstract We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video-based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality-agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features: (1) High-Fidelity and Long-Duration Generation. We adopt a spatial-temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial-temporal information, enabling high-fidelity details and the ability to generate predictions over long durations. (2)Fine-grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model’s ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state-of-the-art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction, and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.

232Improving Probabilistic Diffusion Models With Optimal Covariance Matching

[openreview] [pdf]

Abstract The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed covariance moment matching technique and introduce a novel method for learning the diagonal covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency, recall rate and likelihood of both diffusion models and latent diffusion models.

233Learning Actionable Counterfactual Explanations in Large State Spaces

[openreview] [pdf]

Abstract An increasing number of high-stakes domains rely on machine learning to make decisions that have significant consequences for individuals, such as in loan approvals and college admissions. The black-box nature of these processes has led to a growing demand for solutions that make individuals aware of potential ways they could improve their qualifications. Counterfactual explanations (CFEs) are one form of feedback commonly used to provide insight into decision-making systems. Specifically, contemporary CFE generators provide explanations in the form of low-level CFEs whose constituent actions precisely describe how much a negatively classified individual should add or subtract from their input features to achieve the desired positive classification. However, the low-level CFE generators have several shortcomings: they are hard to scale, often misaligned with real-world conditions, constrained by information access (e.g., they can not query the classifier), and make inadequate use of available historical data. To address these challenges, we propose three data-driven CFE generators that create generalizable CFEs with desirable characteristics for individuals and decision-makers. Through extensive empirical experiments, we compare the proposed CFE generators with a low-level CFE generator on four real-world (BRFSS, Foods, and two NHANES datasets), five semi-synthetic, and five variants of fully-synthetic datasets. Our problem can also be seen as learning an optimal policy in a family of large but deterministic Markov decision processes.

234DDRL: A DIFFUSION-DRIVEN REINFORCEMENT LEARNING APPROACH FOR ENHANCED TSP SOLUTIONS

[openreview] [pdf]

Abstract The Traveling Salesman Problem (TSP) is a fundamental challenge in combinatorial optimization, known for its NP-hard complexity. Reinforcement Learning (RL) has proven to be effective in managing larger and more complex TSP instances, yet it encounters challenges such as training instability and necessity for a substantial amount of training resources. Diffusion models, known for iteratively refining noisy inputs to generate high-quality solutions, offer scalability and exploration capabilities for TSP but may struggle with optimality in complex cases and require large, resource-intensive training datasets. To address these limitations, we propose DDRL (Diffusion-Driven Reinforcement Learning), which integrates diffusion models with RL. DDRL employs a latent vector to generate an adjacency matrix, merging image and graph learning within a unified RL framework. By utilizing a pre-trained diffusion model as a prior, DDRL exhibits strong scalability and enhanced convergence stability. We also provide theoretical analysis that training DDRL aligns with the diffusion policy gradient in the process of solving the TSP, demonstrating its effectiveness. Additionally, we introduce novel constraint datasets—obstacle, path, and cluster constraints—to evaluate DDRL’s generalization capabilities. We demonstrate that DDRL offers a robust solution that outperforms existing methods in both basic and constrained TSP problems.

235Classroom-Inspired Multi-Mentor Distillation with Adaptive Learning Strategies

[openreview] [pdf]

Abstract We proposeClassroomKD, a novel multi-mentor knowledge distillation framework inspired by classroom environments to enhance knowledge transfer between student and multiple mentors. Unlike traditional methods that rely on fixed mentor-student relationships, our framework dynamically selects and adapts the teaching strategies of diverse mentors based on their effectiveness for each data sample. ClassroomKD comprises two main modules: theKnowledge Filtering (KF)Module and theMentoringModule. The KF Module dynamically ranks mentors based on their performance for each input, activating only high-quality mentors to minimize error accumulation and prevent information loss. The Mentoring Module adjusts the distillation strategy by tuning each mentor’s influence according to the performance gap between the student and mentors, effectively modulating the learning pace. Extensive experiments on image classification (CIFAR-100 and ImageNet) and 2D human pose estimation (COCO Keypoints and MPII Human Pose) demonstrate that ClassroomKD outperforms existing knowledge distillation methods for different network architectures. Our results highlight that a dynamic and adaptive approach to mentor selection and guidance leads to more effective knowledge transfer, paving the way for enhanced model performance through distillation.

236Prompt Optimization with Logged Bandit Data

[openreview] [pdf]

Abstract We study how to use naturally available user feedback, such as clicks, to optimize large language model (LLM) pipelines for generating personalized sentences using prompts. Naive approaches, which estimate the policy gradient in the prompt space, suffer either from variance caused by the large action space of prompts or bias caused by inaccurate reward predictions. To circumvent these challenges, we proposeDirect Sentence Off-policy gradient(DSO), which estimates the policy gradient by leveraging similarity among generated sentences, substantially reducing variance while suppressing the bias. Empirical results on our newly established suite of benchmarks, calledOfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts is large.

237Exploring the Design Space of Diffusion Bridge Models via Stochasticity Control

[openreview] [pdf]

Abstract Diffusion bridge models effectively facilitate image-to-image (I2I) translation by connecting two distributions. However, existing methods overlook the impact of noise in sampling SDEs, transition kernel, and the base distribution on sampling efficiency, image quality and diversity. To address this gap, we propose the Stochasticity-controlled Diffusion Bridge (SDB), a novel theoretical framework that extends the design space of diffusion bridges, and provides strategies to mitigate singularities during both training and sampling. By controlling stochasticity in the sampling SDEs, our sampler achieves speeds up to 5×5 \times faster than the baseline, while also producing lower FID scores. After training, SDB sets new benchmarks in image quality and sampling efficiency via managing stochasticity within the transition kernel. Furthermore, introducing stochasticity into the base distribution significantly improves image diversity, as quantified by a newly introduced metric.

238Consistency Model is an Effective Posterior Sample Approximation for Diffusion Inverse Solvers

[openreview] [pdf]

Abstract Diffusion Inverse Solvers (DIS) are designed to sample from the conditional distribution pθ(X0y)p_{\theta}(X_0|y), with a pre-trained diffusion model pθ(X0)p_{\theta}(X_0), an operator f(.)f(.), and a measurement y=f(x0)y=f(x'_0) derived from an unknown image x0x'_0. Existing DIS estimate the conditional score function by evaluating f(.)f(.) with an approximated posterior sample drawn from pθ(X0Xt)p_{\theta}(X_0|X_t). However, most prior approximations rely on the posterior means, which may not lie in the support of the image distribution and diverge from the appearance of genuine images. Such out-of-support samples may significantly degrade the performance of the operator f(.)f(.), particularly when it is a neural network. In this paper, we introduces a novel approach for posterior approximation that guarantees to generate valid samples within the support of the image distribution, and also enhances the compatibility with neural network-based operators f(.)f(.). We first demonstrate that the solution of the Probability Flow Ordinary Differential Equation (PF-ODE) with an initial value xtx_t yields an effective posterior sample of pθ(X0Xt=xt)p_{\theta}(X_0|X_t=x_t) with high probability. Based on this observation, we adopt the Consistency Model (CM), which is distilled from PF-ODE, for posterior sampling. Through extensive experiments, we show that our proposed method for posterior sample approximation substantially enhance the effectiveness of DIS for neural network operators f(.)f(.) (e.g., in semantic segmentation). The source code is provided in the supplementary material.

239Distilling the Knowledge in Data Pruning

[openreview] [pdf]

Abstract With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student’s may improve results. Our code will be made available.

240Continuous Ensemble Weather Forecasting with Diffusion models

[openreview] [pdf]

Abstract Weather forecasting has seen a shift in methods from numerical simulations to data-driven systems. While initial research in the area focused on deterministic forecasting, recent works have used diffusion models to produce skillful ensemble forecasts. These models are trained on a single forecasting step and rolled out autoregressively. However, they are computationally expensive and accumulate errors for high temporal resolution due to the many rollout steps. We address these limitations with Continuous Ensemble Forecasting, a novel and flexible method for sampling ensemble forecasts in diffusion models. The method can generate temporally consistent ensemble trajectories completely in parallel, with no autoregressive steps. Continuous Ensemble Forecasting can also be combined with autoregressive rollouts to yield forecasts at an arbitrary fine temporal resolution without sacrificing accuracy. We demonstrate that the method achieves competitive results for global weather forecasting with good probabilistic properties.

241Mitigating Goal Misgeneralization via Minimax Regret

[openreview] [pdf]

Abstract Robustness research in reinforcement learning often focuses on ensuring that the policy consistently exhibits capable, goal-driven behavior. However, not every capable behavior is the intended behavior.Goal misgeneralizationcan occur when the policy generalizes capably with respect to a ‘proxy goal’ whose optimal behavior correlates with the intended goal on the training distribution, but not out of distribution. Though the intended goal would be ambiguous if they were perfectly correlated in training, we show progress can be made if the goals are onlynearly ambiguous, with the training distribution containing a small proportion ofdisambiguatinglevels. We observe that the training signal from disambiguating levels could be amplified by regret-based prioritization. We formally show that approximately optimal policies on maximal-regret levels avoid the harmful effects of goal misgeneralization, which may exist without this prioritization. Empirically, we find that current regret-based Unsupervised Environment Design (UED) methods can mitigate the effects of goal misgeneralization, though do not always entirely eliminate it. Our theoretical and empirical results show that as UED methods improve they could further mitigate goal misgeneralization in practice.

242A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

[openreview] [pdf]

Abstract Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.

243DiffMove: Human Trajectory Recovery via Conditional Diffusion Model

[openreview] [pdf]

Abstract Recovering human trajectories from incomplete or missing data is crucial for many mobility-based urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the baselines.

244Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes.

245Tighter Performance Theory of FedExProx

[openreview] [pdf]

Abstract We revisit FedExProx -- a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies-based on gradient diversity and Polyak stepsizes --- again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Łojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.

246Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

[openreview] [pdf]

Abstract We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P’s regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

247d-Linear Generation Error Bound for Distributed Diffusion Models

[openreview] [pdf]

Abstract The recent rise of distributed diffusion models has been driven by the explosive growth of data and the increasing demand for data generation. However, distributed diffusion models face unique challenges in resource-constrained environments. Existing approaches lack theoretical support, particularly with respect to generation error in such settings. In this paper, we are the first to derive the generation error bound for distributed diffusion models with arbitrary pruning, not assuming perfect score approximation. By analyzing the convergence of the score estimation model trained with arbitrary pruning in a distributed manner, we highlight the impact of complex factors such as model evolution dynamics and arbitrary pruning on the generation performance. This theoretical generation error bound is linear in the data dimension dd, aligning with state-of-the-art results in the single-worker paradigm.

248Dual Caption Preference Optimization for Diffusion Models

[openreview] [pdf]

Abstract Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified a performance issue in previous optimization methods, where using the same prompt for preferred and less preferred images, known as the irrelevant prompt issue, restricts model performance. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT-Chosen, Diffusion-DPO and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.

249Enhancing Adversarial Transferability Through Exploiting Multiple Randomized Trajectories for Better Global Guidance

[openreview] [pdf]

Abstract Deep neural networks are well-known for their vulnerability to adversarial examples, particularly demonstrating poor performance in white-box attack settings. However, most white-box attack methods heavily depend on the target model and often get trapped in local optima, leading to limited adversarial transferability. Techniques such as momentum, variance reduction, and gradient penalty mitigate overfitting by combining historical information with local regions around adversarial examples, but exploration of the global loss landscape remains constrained, hindering further performance improvements.In this work, we find that initialization influences the optimization of adversarial examples, often guiding them toward multiple local optima, providing an opportunity to explore the loss landscape more effectively. Based on this insight, we propose two strategies: randomized global initialization and dual examples. These strategies utilize multiple trajectories from benign samples to capture global optimization directions, enhancing adversarial transferability. Our approach integrates seamlessly with existing adversarial attack methods and significantly improves transferability, as demonstrated by empirical evaluations on the standard ImageNet dataset.

250Looking Backward: Retrospective Backward Synthesis for Goal-Conditioned GFlowNets

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets), a new family of probabilistic samplers, have demonstrated remarkable capabilities to generate diverse sets of high-reward candidates, in contrast to standard return maximization approaches (e.g., reinforcement learning) which often converge to a single optimal solution. Recent works have focused on developing goal-conditioned GFlowNets, which aim to train a single GFlowNet capable of achieving different outcomes as the task specifies. However, training such models is challenging due to extremely sparse rewards, particularly in high-dimensional problems. Moreover, previous methods suffer from the limited coverage of explored trajectories during training, which presents more pronounced challenges when only offline data is available. In this work, we propose a novel method called \textbf{R}etrospective \textbf{B}ackward \textbf{S}ynthesis (\textbf{RBS}) to address these critical problems. Specifically, RBS synthesizes new backward trajectories in goal-conditioned GFlowNets to enrich training trajectories with enhanced quality and diversity, thereby introducing copious learnable signals for effectively tackling the sparse reward problem. Extensive empirical results show that our method improves sample efficiency by a large margin and outperforms strong baselines on various standard evaluation benchmarks.

251α-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

[openreview] [pdf]

Abstract Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose (\alpha)-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, (\alpha)-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for (\alpha)-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that (\alpha)-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment.

252Soup to go: mitigating forgetting during continual learning with model averaging

[openreview] [pdf]

Abstract In continual learning with pretrained large language models (LLMs), where data from instruction fine-tuning (IFT) tasks arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when the IFT tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the LLM has learned? Inspired by a classical continual learning method, L2 penalty to previous weights, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges models with earlier checkpoints trained on previous tasks during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. However, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms penalty methods like L2 regression and EWC, as well as other common merging techniques such as Task Arithmetic, and TIES Merging. Finally, we show that using our method, a single model can simultaneously perform well on a range of fine-tuning tasks in diverse domains, including Math, Law and Code.

253Adapting Prediction Sets to Distribution Shifts Without Labels

[openreview] [pdf]

Abstract Recently there has been a surge of interest to deploy confidence set predictions rather than point predictions. Unfortunately, the effectiveness of such prediction sets is frequently impaired by distribution shifts in practice, and the challenge is often compounded by the lack of ground truth labels at test time. In this paper, we present a method for improving the quality of outputted prediction sets using only unlabeled data from the test domain. This is achieved by two new methods called ECP\texttt{ECP} and EACP\texttt{E{\small A}CP}, that sit on top of existing set-valued classification methods and adjust their intervals according to the base model’s own uncertainty evaluation on the unlabeled test data. Through extensive experiments on a number of large-scale datasets and neural network architectures, we show that our methods provide consistent improvement over existing conformal prediction based baselines and nearly match the performance of fully supervised methods.

254Offline Safe Policy Optimization From Human Feedback

[openreview] [pdf]

Abstract Offline preference-based reinforcement learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, and then use constrained RL to optimize a safe policy. However, inaccuracies in the reward and cost learning can impair performance when used with constrained RL methods. To address these challenges, (a) we introduce a framework that learns a policy based on pairwise preferences regarding the agent’s behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments, without access to ground-truth rewards or costs; (b) we combine the preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved using a Lagrangian method that directly learns reward maximizing safe policy without explicitly learning reward and cost models, avoiding the need for constrained RL; (c) to evaluate our approach, we construct new datasets with synthetic human feedback, built upon a well-established offline safe RL benchmark. Empirically, our method successfully learns safe policies with high rewards, outperforming baselines with ground-truth reward and cost, as well as state-of-the-art RLHF approaches.

255Goal Achievement Guided Exploration: Mitigating Premature Convergence in Reinforcement Learning

[openreview] [pdf]

Abstract Premature convergence to suboptimal policies remains a significant challenge in reinforcement learning (RL), particularly in tasks with sparse rewards or non-convex reward landscapes. Existing work usually utilizes reward shaping, such as curiosity-based internal rewards, to encourage exploring promising spaces. However, this may inadvertently introduce new local optima and impair the optimization for the actual target reward. To address this issue, we propose Goal Achievement Guided Exploration (GAGE), a novel approach that incorporates an agent’s goal achievement as a dynamic criterion for balancing exploration and exploitation. GAGE adaptively adjusts the exploitation level based on the agent’s current performance relative to an estimated optimal performance, thereby mitigating premature convergence. Extensive evaluations demonstrate that GAGE substantially improves learning outcomes across various challenging tasks by adapting convergence based on task success. Applicable to both continuous and discrete tasks, GAGE seamlessly integrates into existing RL frameworks, highlighting its potential as a versatile tool for enhancing exploration strategies in RL.

256Elucidating the Preconditioning in Consistency Distillation

[openreview] [pdf]

Abstract Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textit{Analytic-Precond} to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher’s, and achieve 2×2\times to 3×3\times training acceleration of consistency trajectory models in multi-step generation across various datasets.

257Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

[openreview] [pdf]

Abstract Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter’s training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter’s superior performance. Notably, DiffPuter achieves an average improvement of 8.10% in MAE and 5.64% in RMSE compared to the most competitive existing method.

258Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation

[openreview] [pdf]

Abstract Distributed learning problems have gained significant popularity due to the increasing need for cluster training and the emergence of novel paradigms like Federated Learning (FL). One variant of FL, called Vertical Federated Learning (VFL), partitions data based on features across devices. The objective is to collectively train a model using the information available on each user’s device. This paper focuses on solving the VFL problem using the saddle point reformulation via the classical Lagrangian function. We first demonstrate how this formulation can be solved using deterministic methods. But more importantly, the paper explores various stochastic modifications to adapt to practical scenarios, such as employing compression techniques for efficient information transmission, enabling partial participation for asynchronous communication, and utilizing coordinate selection for faster local computation. We show that the saddle point reformulation plays a key role and opens up possibilities to use mentioned extension that seem to be impossible in the standard minimization formulation. Convergence estimates are provided for each algorithm, demonstrating their effectiveness in addressing the VFL problem. Additionally, alternative reformulations of the VFL problem are investigated, and numerical experiments are conducted to validate the proposed methods’ performance and effectiveness.

259Interactive Adjustment for Human Trajectory Prediction with Individual Feedback

[openreview] [pdf]

Abstract Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along, namely individual feedback. Individual feedback exists in the sequential nature of trajectory prediction, where earlier predictions of a target can be verified over time by his ground-truth trajectories to obtain feedback which provides valuable experience for subsequent predictions on the same agent. In this paper, we show such feedback can reveal the strengths and weaknesses of the model’s predictions on a specific target and heuristically guide to deliver better predictions on him. We present an interactive adjustment network to effectively model and leverage the feedback. This network first exploits the feedback from previous predictions to dynamically generate an adjuster which then interactively makes appropriate adjustments to current predictions for more accurate ones. We raise a novel displacement expectation loss to train this interactive architecture. Through experiments on representative prediction methods and widely-used benchmarks, we demonstrate the great value of individual feedback and the superior effectiveness of proposed interactive adjustment network. Our code will be made publicly available.

260Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

[openreview] [pdf]

Abstract We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al. (2024b) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret and constraint violation, while, when they are adversarial, it attains O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

261A Contextual Online Learning Theory of Brokerage

[openreview] [pdf]

Abstract We study the role ofcontextual informationin the online learning problem of brokerage between traders. At each round, two traders arrive with secret valuations about an asset they wish to trade. The broker suggests a trading price based on contextual data about the asset. Then, the traders decide to buy or sell depending on whether their valuations are higher or lower than the brokerage price. We assume the market value of traded assets is an unknown linear function of a dd-dimensional vector representing the contextual information available to the broker. Additionally, at each time step, we model traders’ valuations as independent bounded zero-mean perturbations of the asset’s current market value, allowing for potentially different unknown distributions across traders and time steps. Consistently with the existing online learning literature, we evaluate the performance of a learning algorithm with the regret with respect to thegain from trade. If the noise distributions admit densities bounded by some constant LL, then, for any time horizon TT:If the agents’ valuations are revealed after each interaction, we provide an algorithm achieving O(LdlnT)O ( L d \ln T ) regret, and show a corresponding matching lower bound of Ω(LdlnT)\Omega( Ld \ln T ).If only their willingness to sell or buy at the proposed price is revealed after each interaction, we provide an algorithm achieving O(LdTlnT)O( \sqrt{LdT \ln T }) regret, and show that this rate is optimal (up to logarithmic factors), via a lower bound of Ω(LdT)\Omega(\sqrt{LdT}).To complete the picture, we show that if the bounded density assumption is lifted, then the problem becomes unlearnable, even with full feedback.

262Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent’s behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. Upon closer investigation, however, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing lpl_p-norm constrained attacks, which can barely alter the semantics of the input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel diffusion-based state perturbation attack to go beyond this limitation. Specifically, we train a history-conditioned diffusion model, enhanced with policy guidance and realism detection to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, and significantly lowers the agent’s cumulative reward in various Atari games by more than 50%. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies for safety-critical domains.

263Mitigating Embedding Collapse in Diffusion Models for Categorical Data

[openreview] [pdf]

Abstract Latent diffusion models have enabled continuous-state diffusion models to handle a variety of datasets, including categorical data. However, most methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training. We propose a novel objective combining the joint embedding-diffusion variational lower bound with a Consistency-Matching (CM) regularizer, alongside a shifted cosine noise schedule and random dropping strategy. The CM regularizer ensures the recovery of the true data distribution. Experiments on benchmarks show that CATDM mitigates embedding collapse, yielding superior results on FFHQ, LSUN Churches, and LSUN Bedrooms. In particular, CATDM achieves an FID of 6.81 on ImageNet 256×256256\times256 with 50 steps. It outperforms non-autoregressive models in machine translation and is on a par with previous methods in text generation.

264Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.

265Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

[openreview] [pdf]

Abstract No absctract

266Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

[openreview] [pdf]

Abstract Understanding drivers’ decision-making is crucial for road safety. Although predicting the ego-vehicle’s path is valuable for driver-assistance systems, existing methods mainly focus on external factors like other vehicles’ motions, often neglecting the driver’s attention and intent. To address this gap, we infer the ego-trajectory by integrating the driver’s attention and the surrounding scene. We introduce RouteFormer, a novel multimodal ego-trajectory prediction network combining GPS data, environmental context, and driver field-of-view—comprising first-person video and gaze fixations. We also present the Path Complexity Index (PCI), a new metric for trajectory complexity that enables a more nuanced evaluation of challenging scenarios. To tackle data scarcity and enhance diversity, we introduce GEM, a comprehensive dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data. Extensive evaluations on GEM and DR(eye)VE demonstrate that RouteFormer significantly outperforms state-of-the-art methods, achieving notable improvements in prediction accuracy across diverse conditions. Ablation studies reveal that incorporating driver field-of-view data yields significantly better average displacement error, especially in challenging scenarios with high PCI scores, underscoring the importance of modeling driver attention. All data, code, and models will be made publicly available.

267Data Exfiltration in Diffusion Models: A Backdoor Attack Approach

[openreview] [pdf]

Abstract As diffusion models (DMs) become increasingly susceptible to adversarial attacks, this paper investigates a novel method of data exfiltration through strategically implanted backdoors. Unlike conventional techniques that directly alter data, we pioneer the use of unique trigger embeddings for each image to enable covert data retrieval. Furthermore, we extend our exploration to text-to-image diffusion models such as Stable Diffusion by introducing the Caption Backdoor Subnet (CBS), which exploits these models for both image and caption extraction. This innovative approach not only reveals an unexplored facet of diffusion model security but also contributes valuable insights toward enhancing the resilience of generative models against sophisticated threats.

268Policy Transfer via Latent Graph Planning

[openreview] [pdf]

Abstract We introduce a transfer learning framework for deep reinforcement learning that integrates graph-based planning with self-supervised representation learning to efficiently transfer knowledge across tasks. While standard reinforcement learning aims to learn policies capable of solving long-horizon tasks, the resulting policies often fail to generalize to novel tasks and environments. Our approach addresses this limitation by decomposing long-horizon tasks into sequences of transferable short-horizon tasks modeled by goal-conditioned policies. We utilize a planning graph to generate fine-grained sub-goals that guide these short-horizon policies to solve novel long-horizon tasks. Experimental results show that our method improves sample efficiency and demonstrates an improved ability to solve sparse-reward and long-horizon tasks compared to baseline methods in challenging single-agent and multi-agent scenarios. In particular, compared to the state-of-the-art, our method achieves the same or better expected policy reward while requiring fewer training samples when learning novel tasks.

269How to distill task-agnostic representations from many teachers?

[openreview] [pdf]

Abstract Casting complex inputs onto tractable representations is a critical step in many fields. Differences in architectures, loss functions, input modalities, and datasets lead to embedding models that capture diverse information of the input. Multi-teacher distillation seeks to exploit this diversity to create richer representations but often remains task-specific. We extend this framework by proposing a task-oriented setting that introduces an objective function based on the “majority vote” principle. We demonstrate that the mutual information between the student and the teachers is an upper bound for this function, providing a task-agnostic loss for our distillation procedure. An extensive evaluation is performed in different domains ---natural language processing, computer vision, and molecular modeling --- indicating that our method effectively leverages teacher diversity to produce more informative representations. Finally, we use our method to train and release new state-of-the-art embedders, enabling improved downstream performance in NLP and molecular modeling.

270Can In-context Learning Really Generalize to Out-of-distribution Tasks?

[openreview] [pdf]

Abstract In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL’s well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL’s performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

271Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

[openreview] [pdf]

Abstract Early Exiting (EE) is a promising technique for speeding up inference at the cost of limited performance loss. It adaptively allocates compute budget to data points based on their difficulty by exiting at earlier layers when predictions are confident. In this study, we first present a novel perspective on the EE approach, demonstrating that larger models, when deployed with EE, can achieve higher performance than smaller models while maintaining similar computational costs. As existing EE approaches rely on confidence estimation at each exit point, we further study the impact of overconfidence on the controllability of the compute/performance trade-off. We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a datapoint’s condfidence but on the average accuracy of samples with similar confidence levels from a held-out validation set. In our experiments with MSDNets and Vision Transformer architectures on CIFAR-10, CIFAR-100, and ImageNet, we show that PCEE offers a simple yet computationally efficient approach that provides better control over performance than standard confidence-based approaches, and allows us to scale up model sizes to yield performance gain while reducing the computational cost.

272Federated Unlearning with Diffusion Models

[openreview] [pdf]

Abstract In recent years, diffusion models are widely adopted by individual users due to their outstanding performance in generation. During usage, individual users develop a need to forget privacy-related contents, making the scenario of using diffusion models on the clients a natural federated unlearning setting. For this scenario, we propose FedDUL, a Federated UnLearning method with Diffusion models, which addresses the unlearn requests from clients using the diffusion models. On one hand, we utilize local data on the clients to perform attention-based unlearning, enabling the local diffusion model to forget the concepts specified by the clients. On the other hand, we filter and group the unlearn requests from clients, gradually aggregating reasonable requests into the global diffusion model on the server, thereby protecting client privacy within the global model. The theoretical analysis further demonstrates the inherent unity between the federated unlearning problem based on diffusion models and federated learning, and extend this unity to traditional federated unlearning methods. Extensive quantitation and visualization experiments are conducted to evaluate the unlearning of both local and global models and discuss the communication and computation costs of our method, demonstrating that our method can satisfy the unlearn requests of multiple clients without compromising the generative capabilities for irrelevant concepts, providing new ideas and methods for the application of diffusion models in federated unlearning.

273Strategic Exploration for Inverse Constraint Inference with Efficiency Guarantee

[openreview] [pdf]

Abstract In many realistic applications, the constraint is not readily available, and we need to infer the constraints respected by the expert agents from their behaviors. The problem is known as Inverse Constraint Inference (ICI). A common solver, Inverse Constrained Reinforcement Learning (ICRL) seeks to recover the optimal constraints in complex environments in a data-driven manner. Existing ICRL algorithms collect training samples from an interactive environment. However, the efficacy and efficiency of these sampling strategies remain unknown. To bridge this gap, we introduce a strategic exploration framework with guaranteed efficiency. Specifically, we define a feasible constraint set for ICRL problems and investigate how expert policy and environmental dynamics influence the optimality of constraints. Motivated by our findings, we propose two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimation and 2) strategically constraining the exploration policy. Both algorithms are theoretically grounded with tractable sample complexity. We empirically demonstrate the performance of our algorithms under various environments.

274Supervised and Semi-Supervised Diffusion Maps with Label-Driven Diffusion

[openreview] [pdf]

Abstract In this paper, we introduce Supervised Diffusion Maps (SDM) and Semi-Supervised Diffusion Maps (SSDM), which transform the well-known unsupervised dimensionality reduction algorithm, Diffusion Maps, into supervised and semi-supervised learning tools. The proposed methods, SDM and SSDM, are based on our new approach that treats the labels as a second view of the data. This unique framework allows us to incorporate ideas from multi-view learning. Specifically, we propose constructing two affinity kernels corresponding to the data and the labels. We then propose a multiplicative interpolation scheme of the two kernels, whose purpose is twofold. First, our scheme extracts the common structure underlying the data and the labels by defining a diffusion process driven by the data and the labels. This label-driven diffusion produces an embedding that emphasizes the properties relevant to the label-related task. Second, the proposed interpolation scheme balances the influence of the two kernels. We show on multiple benchmark datasets that the embedding learned by SDM and SSDM is more effective in downstream regression and classification tasks than existing unsupervised, supervised, and semi-supervised nonlinear dimension reduction methods.

275Learning to Permute with Discrete Diffusion

[openreview] [pdf]

Abstract The group of permutations SnS_n, also known as the finite symmetric groups, are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over SnS_n poses significant challenges due to its intractable size and discrete nature. In this paper, we introduceSymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over SnS_n by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems.

276PROGRESSIVE KNOWLEDGE DISTILLATION (PKD): A MODULAR APPROACH FOR ARCHITECTURE-AGNOSTIC KNOWLEDGE DISTILLATION

[openreview] [pdf]

Abstract \textbf{Knowledge distillation (KD)} is a key technique for training \textbf{lightweight deep neural networks}, particularly in \textbf{resource-constrained environments}. While existing KD methods utilize intermediate features to improve student models, they often overlook the proper \textbf{alignment between teacher-student layers} and fail to select the most \textbf{informative data} for training each student layer. These limitations are especially pronounced in \textbf{architecture-agnostic scenarios}, where different network architectures complicate knowledge transfer.We propose \textbf{PKD}, a \textbf{Progressive Knowledge Distillation} framework that progressively aligns teacher and student layers through \textbf{feature-based modularization}. Each student module is trained using the most \textbf{representative features} from its corresponding teacher module, starting with the shallowest layers and progressively moving to deeper ones. This training method enables efficient, architecture-agnostic knowledge transfer across a variety of model architectures. \textbf{Experiments on CIFAR-100 and ImageNet-1K} demonstrate that PKD outperforms baseline models, achieving performance improvements of up to \textbf{4.54%} and \textbf{6.46%}, respectively, thereby validating its effectiveness in diverse neural network settings.

277What should an AI assessor optimise for?

[openreview] [pdf]

Abstract An AI assessor is an external, ideally independent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can leverage information from the test results of many other AI systems and have the flexibility of being trained on any loss function: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target loss? Or could it be better to train for a different loss and then map predictions back to the target loss? Using ten regression problems with tabular data, we experimentally explore this question for regression losses with monotonic and nonmonotonic mappings and find that, contrary to intuition, optimising for more informative losses is not generally better. Surprisingly though, some monotonic transformations, such as the logistic loss used to minimise the absolute or squared error, are promising.

278XXLTraffic: Expanding and Extremely Long Traffic forecasting beyond test adaptation

[openreview] [pdf]

Abstract Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the distribution shift nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and long temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, the longest available public traffic dataset with the longest timespan collected from Los Angeles, USA, and New South Wales, Australia, curated to support research in extremely long forecasting beyond test adaptation. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle the extremely long forecasting problems beyond test adaptation. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.

279Training Task Experts through Retrieval Based Distillation

[openreview] [pdf]

Abstract One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and shows that our method significantly improves performance by up to 10.76% on SQuAD, 1.37% on MNLI, and 1.94% on BBH.

280Improving real-world sequence design with a simple meta-heuristic for detecting distribution shift

[openreview] [pdf]

Abstract Biological sequence design is one of the most impactful areas where model-based optimization is applied. A common scenario involves using a fixed training set to train predictive models, with the goal of designing new sequences that outperform those present in the training data. This by definition results in a distribution shift, where the model is applied to samples that are substantially different from those in the training set (or otherwise they wouldn’t have a chance of being much better). While most MBO methods offer some balancing heuristic to control for false positives, finding the right balance of pushing the design distribution while maintaining model accuracy requires deep knowledge of the algorithm and artful application, limiting successful adoption by practitioners. To tackle this issue, we propose a straightforward meta-algorithm for design practitioners that detects distribution shifts when using any MBO. By doing a real-world sequence design experiment, we show that (1) Real world distribution shift is far more severe than observed in simulated settings, where most MBO algorithms are benchmarked (2) Our approach successfully reduces the adverse effects of distribution shift. We believe this method can significantly improve design quality for sequence design tasks and potentially other domain applications where offline optimization faces harsh distribution shifts.

281Algorithm for Concept Extrapolation: Diverse Generalization via Selective Disagreement

[openreview] [pdf]

Abstract Standard deep learning approaches often struggle to handle out-of-distribution data, especially when the distributional shift breaks spurious correlations. While some approaches to handling spurious correlations under distributional shift aim to separate causal and spurious features without access to target distribution data, they rely on labeled data from different domains or contingent assumptions about the nature of neural representations. Existing methods that do make use of unlabeled target data make strict assumptions about the target data distribution. To overcome these limitations, we present the Algorithm for Concept Extrapolation (ACE). Using an exponentially-weighted disagreement loss to maximize disagreement on target instances \textit{that break spurious correlations}, ACE achieves state of the art performance on spurious complete correlation benchmarks. We also show ACE is robust to unlabeled target distributions where spurious and ground truth features are not statistically independent. Finally, we demonstrate the applicability of ACE for handling goal-misgeneralization in deep reinforcement learning, with our ``ACE agent’’ achieving a 16% higher level completion rate in the CoinRun goal misgeneralisation problem when the coin is randomly placed in the level.

282Learning-Augmented Robust Algorithmic Recourse

[openreview] [pdf]

Abstract The widespread use of machine learning models in high-stakes domains can have a major negative impact, especially on individuals who receive undesirable outcomes. Algorithmic recourse provides such individuals with suggestions of minimum-cost improvements they can make to achieve a desirable outcome in the future. However, machine learning models often get updated over time and this can cause a recourse to become invalid (i.e., not lead to the desirable outcome). The robust recourse literature aims to choose recourses less sensitive, even against adversarial model changes, but this comes at a higher cost. To overcome this obstacle, we initiate the study of algorithmic recourse through the learning-augmented framework and evaluate the extent to which a designer equipped with a prediction regarding future model changes can reduce the cost of recourse when the prediction is accurate (consistency) while also limiting the cost even when the prediction is inaccurate (robustness). We propose a novel algorithm for this problem, study the robustness-consistency trade-off, and analyze how prediction accuracy affects performance.

283Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models

[openreview] [pdf]

Abstract Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e. determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these handcrafted heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs). This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g. discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define. Moreover, IGE offers the exciting and previously impossible opportunity to recognize and capitalize on serendipitous discoveries that cannot be predicted ahead of time. We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities.

284Instant Policy: In-Context Imitation Learning via Graph Diffusion

[openreview] [pdf]

Abstract Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem using a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations – arbitrary trajectories generated in simulation – as a virtually infinite pool of training data. Our experiments, in both simulation and reality, show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks.

285Dataset Condensation with Sharpness-Aware Trajectory Matching

[openreview] [pdf]

Abstract Dataset condensation aims to synthesise datasets with a few representative samples that can effectively represent the original datasets. This enables efficient training and produces models with performance close to those trained on the original sets. Most existing dataset condensation methods conduct dataset learning under the bilevel (inner and outer loop) based optimisation. However, due to its notoriously complicated loss landscape and expensive time-space complexity, the preceding methods either develop advanced training protocols so that the learned datasets generalise to unseen tasks or reduce the inner loop learning cost increasing proportionally to the unrolling steps. This phenomenon deteriorates when the datasets are learned via matching the trajectories of networks trained on the real and synthetic datasets with a long horizon inner loop. To address these issues, we introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalisation capability of learned synthetic datasets by minimising sharpness in the outer loop of bilevel optimisation. Moreover, our approach is coupled with an efficient hypergradient approximation that is mathematically well-supported and straightforward to implement along with controllable computational overhead. Empirical evaluations of SATM demonstrate its effectiveness across various applications, including standard in-domain benchmarks and out-of-domain settings. Moreover, its easy-to-implement properties afford flexibility, allowing it to integrate with other advanced sharpness-aware minimisers. We will release our code on GitHub.

286Decoupled Offline to Online finetuning via Dynamics Model

[openreview] [pdf]

Abstract Constrained by the sub-optimal dataset in offline reinforcement learning (RL), the offline trained agent should be online finetuned before deployment. Due to the conservative offline algorithms and unbalanced state distribution in offline dataset, offline to online finetuning faces severe distribution shift. This shift will disturb the policy improvement during online interaction, even a performance drop. A natural yet unexplored idea is whether policy improvement can be decoupled from distribution shift. In this work, we propose a decoupled offline to online finetuning framework using the dynamics model from model-based methods. During online interaction, only dynamics model is finetuned to overcome the distribution shift. Then the policy is finetuned in offline manner with finetuned dynamics and without further interaction. As a result, online stage only needs to deal with a simpler supervised dynamics learning, rather than the complex policy improvement with the interference from distribution shift. When finetuning the policy, we adopt the offline approach, which ensures the conservatism of the algorithm and fundamentally avoids the sudden performance crashes. We conduct extensive evaluation on the classical datasets of offline RL, demonstrating the effective elimination of distribution shift, stable and superior policy finetuning performance, and exceptional interaction efficiency within our decouple offline to online finetuning framework.

287Simple Policy Optimization

[openreview] [pdf]

Abstract Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO’s approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. SPO integrates the surrogate objective with Total Variation (TV) divergence instead of Kullback-Leibler (KL) divergence, achieving a balance between the theoretical rigor of TRPO and the efficiency of PPO. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO achieves state-of-the-art performance, with a simple implementation and improves sample efficiency, particularly for training large, complex network architectures end-to-end.

288Stabilize continual learning with hyperspherical replay

[openreview] [pdf]

Abstract Neural networks face catastrophic forgetting of previously learned knowledge when training on new task data. While the field of continual learning has made promising progress in reducing this forgetting, recent work has uncovered an interesting phenomenon: existing techniques often exhibit a sharp performance drop on prior tasks during the initial stages of new task training, a phenomenon known as the ”stability gap.” This phenomenon not only raises safety concerns but also challenges the current understanding of neural network behavior in continual learning scenarios. Inspired by this discovery, we revisit two fundamental questions in continual learning: 1) Is the past learned knowledge within deep networks lost abruptly or gradually? and 2) Is past learned knowledge ever completely erased? Our analysis reveals that abrupt forgetting occurs not only in the final fully connected layer but also permeates the feature space and most layers, sparing only the earliest layers. Alarmingly, a single gradient update can severely disrupt the learned class structure. We identify degenerate solutions in the softmax cross-entropy loss as a major contributing factor, with memory samples exhibiting higher feature norms compared to new samples. To address these issues, we pro- pose Adaptive Angular Replay (AAR), a simple yet effective approach that learns features in hyperspherical space using feature and weight normalization. Angular ER demonstrates a strong ability to preserve class structure during task transitions. Additionally, we introduce an adaptive scaling strategy to further mitigate the stability gap and improve overall accuracy.

289Distributed Constrained Optimal Consensus Under a Directed Graph

[openreview] [pdf]

Abstract In this paper, the distributed constrained optimal consensus problem of multi-agent systems under a directed graph is investigated. We propose two projection-based distributed constrained optimal consensus algorithms: one addressing set constraints and the other tailored for general constraints. Only the relative state is exchanged among agents in these two algorithms. In the stability analysis of case with set constraints, we transform the distributed optimization problem into a constrained leaderless consensus problem by adopting a sliding mode approach. Building on this foundational transformation, we further develop a projection-based distributed constrained optimal consensus algorithm to address general constraints. It is shown that the proposed algorithm achieves an ergodic convergence rate of O(1k)O(\frac{1}{k}) with respect to the first-order optimality residuals. Numerical simulations are conducted to validate the effectiveness of our theoretical results.

290How new data pollutes LLM knowledge and how to dilute it

[openreview] [pdf]

Abstract Understanding how the learning of new texts alter the existing knowledge in a large language model is of great importance, because it is through these accumulated changes that the LLM was initially pre-trained, and is also through such changes that continual, new learning in LLMs can proceed. As a result, both desirable alterations (i.e. generalization) and undesirable alterations (i.e. hallucination) can occur. Here, we study the learning of new texts, one at a time, and ask: how does it impact the underlying LLM knowledge? We show that learning new texts induce ‘priming’, an undesirable effect that pollutes existing knowledge where it should not. Centrally, we demonstrate that we can predict how much priming will happen after learning, using token probability before learning. This was empirically robust across models (PALM-2-xs/s, Gemma-2b, Llama-2-7b), of various sizes, and training stages. To show this, we created a new dataset, called “Outlandish” consisting of 1320 different samples with diverse textual characteristics. Finally, we propose two strategies to mitigate the spread of priming: first, a simple text augmentation technique which we call the "stepping-stone’', and second, a novel update pruning technique (“ignore-k”). These decrease priming by a median of 50%-75% and 50%-95% respectively depending on the model architecture, and enhance the specificity of new learning in language models. The dataset and reproducible findings can be found [LINK omitted for double blind review].

291Magnetic Mirror Descent Self-play Preference Optimization

[openreview] [pdf]

Abstract Standard Reinforcement Learning from Human Feedback (RLHF) methods mainly optimize preferences through the Bradley-Terry (BT) reward model, which may misalign with natural human preferences due to the strong transitivity assumption. Recent work has reframed the preference learning problem as a two-player constant-sum game, aiming to learn policies that better reflect human preferences by finding the Nash equilibrium (NE) of this game. However, existing methods under this framework either guarantee only average-iterate convergence or rely on strong first-order approximation assumptions. In this paper, we propose Mirror Descent Self-play Preference Optimization (MDSPO), a novel approach based on Magnetic Mirror Descent (MMD). By introducing an additional magnetic term, MDSPO achieves linear convergence rate to the NE of the regularized game. Furthermore, we establish theoretical guarantees for the convergence of our algorithm to the NE of the original game by periodically updating the reference policy. This approach effectively guarantees that the final policy accurately reflects the true human preferences. To ensure our algorithm is both theoretically sound and practically viable, we provide a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. We demonstrate its effectiveness on a variety of benchmarks.

292Towards Robust Concept Erasure in Diffusion Models: Unlearning Identity, Nudity and Artistic Styles

[openreview] [pdf]

Abstract Diffusion models have achieved remarkable success in generative tasks across various domains. However, the increasing demand for content moderation and the removal of specific concepts from these models has introduced the challenge of \textit{unlearning}. In this work, we present a suite of robust methodologies that significantly enhance the unlearning process by employing advanced loss functions within knowledge distillation frameworks. Specifically, we utilize the Cramer-Wold distance and Jensen-Shannon (JS) divergence to facilitate more efficient and versatile concept removal. Although current non-learning techniques are effective in certain scenarios, they are typically limited to specific categories such as identity, nudity, or artistic style. In contrast, our proposed methods demonstrate robust versatility, seamlessly adapting to and performing effectively across a wide range of concept erasure categories. Our approach outperforms existing techniques, achieving consistent results across different unlearning categories and showcasing its broad applicability. Through extensive experiments, we show that our method not only surpasses previous benchmarks but also addresses key limitations of current unlearning techniques, paving the way for more responsible use of text-to-image diffusion models.

293Continuous Diffusion for Mixed-Type Tabular Data

[openreview] [pdf]

Abstract Score-based generative models (or diffusion models for short) have proven successful for generating text and image data. However, the adaption of this model family to tabular data of mixed-type has fallen short so far. In this paper, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. Specifically, we combine score matching and score interpolation to ensure a common continuous noise distribution for both continuous and categorical features alike. We counteract the high heterogeneity inherent to data of mixed-type with distinct, adaptive noise schedules per feature or per data type. The learnable noise schedules ensure optimally allocated model capacity and balanced generative capability. We homogenize the data types further with model-specific loss calibration and initialization schemes tailored to mixed-type tabular data. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts the sample quality.

294High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

[openreview] [pdf]

Abstract A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings:(i)model shift, where the surrogate model is arbitrary, and(ii)distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that(i)W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but(ii)it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

295TELEPORTATION WITH NULL SPACE GRADIENT PROJECTION FOR OPTIMIZATION ACCELERATION

[openreview] [pdf]

Abstract Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

296softmax is not enough (for sharp out-of-distribution)

[openreview] [pdf]

Abstract A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from “circuits” which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

297Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

[openreview] [pdf]

Abstract For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we proposeLoss-Free Balancing, a new load balancing strategy that operates without auxiliary losses. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.

298Data-Centric Graph Condensation via Diffusion Trajectory Matching

[openreview] [pdf]

Abstract This paper introduces Data Centric Graph Condensation (named DCGC), a data-centric and model-agnostic method for condensing a large graph into a smaller one by matching the distribution between two graphs. DCGC defines the distribution of a graph as the trajectories of its node signals (such as node features and node labels) induced by a diffusion process over the geometric structure, which accommodates multi-order structural information. Built upon this, DCGC compresses the topological knowledge of the original graph into the orders-of-magnitude smaller synthetic one by aligning their distributions in input space. Compared with existing methods that stick to particular GNN architectures and require solving complicated optimization, DCGC can be flexibly applied for arbitrary off-the-shelf GNNs and achieve graph condensation with a much faster speed. Apart from the cross-architecture generalization ability and training efficiency, experiments demonstrate that DCGC yields consistently superior performance than existing methods on datasets with varying scales and condensation ratios.

299LATABLE: TOWARDS LARGE TABULAR MODELS

[openreview] [pdf]

Abstract Tabular data is one of the most ubiquitous data modalities, yet the literature on tabular generative foundation models is lagging behind its text and vision counterparts. Large Tabular Models (LTMs) could revolutionize the way tabular data is used: not as any single dataset analyzed in a vacuum, but contextualized using their metadata and with respect to related datasets. Creating an LTM is difficult, due to the heterogeneous feature spaces of different tabular datasets, metadata, and prior knowledge. In this work, we propose LaTable: a novel tabular diffusion model that addresses these challenges. We show LaTable can be trained across tabular datasets. Through extensive experiments, we find that LaTable displays early signs of scaling laws previously encountered in foundation model regimes. Moreover, LaTable outperform baselines in out-of-distribution few-shot data generation.

300Risk-Sensitive Diffusion: Robustly Optimizing Diffusion Models with Noisy Samples

[openreview] [pdf]

Abstract Diffusion models are mainly studied on image data. However, non-image data (e.g., tabular data) are also prevalent in real applications and tend to be noisy due to some inevitable factors in the stage of data collection, degrading the generation quality of diffusion models. In this paper, we consider a novel problem setting where every collected sample is paired with a vector indicating the data quality: risk vector. This setting applies to many scenarios involving noisy data and we propose risk-sensitive SDE, a type of stochastic differential equation (SDE) parameterized by the risk vector, to address it. With some proper coefficients, risk-sensitive SDE can minimize the negative effect of noisy samples on the optimization of diffusion models. We conduct systematic studies for both Gaussian and non-Gaussian noise distributions, providing analytical forms of risk-sensitive SDE. To verify the effectiveness of our method, we have conducted extensive experiments on multiple tabular and time-series datasets, showing that risk-sensitive SDE permits a robust optimization of diffusion models with noisy samples and significantly outperforms previous baselines.

301CityNav: Language-Goal Aerial Navigation Dataset Using Geographic Information

[openreview] [pdf]

Abstract Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available athttps://anonymous.4open.science/w/city-nav-77E3/.

302Discrete Inversion: A Controllable Latent Space for Multinomial Diffusion and Masked Generative Models

[openreview] [pdf]

Abstract Discrete diffusion models have achieved notable success in tasks like image generation and masked language modeling, yet they face limitations in controlled content editing. This paper introduces {\bf Discrete Inversion}, the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the forward diffusion process, Discrete Inversion facilitates accurate reconstruction and controlled edits without the need for predefined masks or attention map manipulation. We demonstrate the effectiveness of our method across both image and text domains, evaluating it on models like VQ-Diffusion, Paella, and RoBERTa. Our results show that Discrete Inversion not only preserves high fidelity in the original data but also enables flexible and user-friendly editing in discrete spaces, significantly advancing the capabilities of discrete generative models.

303Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation

[openreview] [pdf]

Abstract Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion-model to address the specific downstream tasks. Existing guided diffusion models either rely on training of the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, the offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need anonlinealgorithm capable of collecting data during runtime and supporting ablack-boxobjective function. Moreover, thequery efficiencyof the algorithm is also critical because the objective evaluation of the query is often expensive in the real-world scenarios. In this work, we propose a novel and simple algorithm,Fast Direct, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution (1024×1024\small {1024 \times 1024}) image target generation tasks and six 3D-molecule target generation tasks show 6×\textbf{6}\times up to 10×\textbf{10}\times query efficiency improvement and 11×\textbf{11}\times up to 44×\textbf{44}\times query efficiency improvement, respectively.

304Hierarchical Multiscale Diffuser for Extendable Long-Horizon Planning

[openreview] [pdf]

Abstract This paper introduces the Hierarchical Multiscale Diffuser (HM-Diffuser), a novel approach for efficient long-horizon planning. Building on recent advances in diffusion-based planning, our method addresses the challenge of planning over horizons significantly longer than those available in the training data. We decompose the problem into two key subproblems. The first phase, Progressive Trajectory Extension (PTE), involves stitching short trajectories together to create datasets with progressively longer trajectories. In the second phase, we train the HM-Diffuser on these extended datasets, preserving computational efficiency while enhancing long-horizon planning capabilities. The hierarchical structure of the HM-Diffuser allows for subgoal generation at multiple temporal resolutions, enabling a top-down planning approach that aligns high-level, long-term goals with low-level, short-term actions. Experimental results demonstrate that the combined PTE and HM-Diffuser approach effectively generates long-horizon plans, extending far beyond the originally provided trajectories.

305Multi-expert collaboration: Enhancing heterogeneous knowledge independence and alignment in knowledge distillation

[openreview] [pdf]

Abstract Heterogeneous multi-teacher Knowledge distillation attempt to learn a versatile student neural network from multiple pre-trained heterogeneous teachers. But current methods face issues with a lack of independence and alignment in heterogeneous knowledge. To address this issue, we propose a novel method called Multi-Expert Collaboration (MEC). Our approach aggregates multiple expert classifiers within the student model, replacing the conventional single-head architecture. By ensuring that each expert’s independent classifier operates without interfering with others, we enhance the independence of heterogeneous knowledge. Inspired by Helmholtz Free Energy (HFE) theory, we introduce an anchor-based HFE self-normalization strategy to align the heterogeneous knowledge effectively. This method ensures consistent energy levels across all classifiers, allowing the appropriate classifier to achieve the highest confidence for in-distribution data. Extensive experiments on CIFAR-100 and ImageNet-100 datasets demonstrate that MEC significantly outperforms existing heterogeneous multi-teacher knowledge distillation methods, achieving an average accuracy improvement of over 10%.

306Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

[openreview] [pdf]

Abstract Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process—typically via RLHF—to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pretrained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

307Risk Informed Policy Learning for Safer Exploration

[openreview] [pdf]

Abstract Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods mitigate this by imposing constraints, but these often result in overly conservative behaviours and inefficient learning. Overfitting on negative experiences hampers the agent’s ability to learn accurate risk representations, limiting its exploration of risky yet potentially high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned risk representations by incorporating an inductive bias. By augmenting state features with these risk representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

308OccProphet: Pushing the Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with an Observer-Forecaster-Refiner Framework

[openreview] [pdf]

Abstract Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, \textit{i.e.}, OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while maintaining forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58%\sim78% of the computational cost with a 2.6× speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4%\sim18% relatively higher forecasting accuracy. The code will be publicly available.

309Enhancing Training Robustness through Influence Measure

[openreview] [pdf]

Abstract In the field of machine learning, the pursuit of robust and accurate models is ongoing. A key aspect of achieving robustness lies in identifying which data points in the training set should be excluded and which high-quality, potentially unlabeled data points outside the training set should be incorporated to improve the model’s performance on unseen data. To accomplish this, an effective metric is needed to evaluate the contribution of each data point toward enhancing overall model performance. This paper proposes the use of an influence measure as a metric to assess the impact of training data on test set performance. Additionally, we introduce a data selection method to optimize the training set as well as a dynamic active learning algorithm driven by the influence measure. The effectiveness of these methods is demonstrated through extensive simulations and real-world datasets.

310PHI-S: Distribution Balancing for Agglomerative Models

[openreview] [pdf]

Abstract Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed “agglomerative models.” We build upon this body of work by studying the effect of the teachers’ activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique “PHI Standardization” (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

311Generative bandit optimization via diffusion posterior sampling

[openreview] [pdf]

Abstract Many real-world discovery problems, including drug and material design, can be modeled within the bandit optimization framework, where an agent selects a sequence of experiments to efficiently optimize an unknown reward function. However, classic bandit algorithms operate on fixed finite or continuous action sets, making discovering novel designs impossible in the former case, and often leading to the curse of dimensionality in the latter, thus rendering these methods impractical. In this work, we first formalize thegenerative banditsetting, where an agent wishes to maximize an unknown reward function over the support of a data distribution, often calleddata manifold, which implicitly encodes complex constraints (e.g., the geometry of valid molecules), and from which (unlabeled) sample data is available (e.g., a dataset of valid molecules). We then propose Diffusion Posterior Sampling (DiffPS), an algorithm that tackles the exploration-exploitation problem directly on the learned data manifold by leveraging a conditional diffusion model. We formally show that the statistical complexity of DiffPS adapts to theintrinsic dimensionalityof the data, overcoming the curse of dimensionality in high-dimensional settings. Our experimental evaluation supports the theoretical claims and demonstrates promising performance in practice.

312CoDiCast: Conditional Diffusion Model for Weather Prediction with Uncertainty Quantification

[openreview] [pdf]

Abstract Accurate weather forecasting is critical for science and society. Yet, existing methods have not demonstrated high accuracy, low uncertainty, and high computational efficiency simultaneously. On one hand, to quantify the uncertainty in weather predictions, the strategy of ensemble forecast (i.e., generating a set of diverse predictions) is often employed. However, traditional ensemble numerical weather prediction (NWP) is computationally intensive. On the other hand, even though most existing machine learning-based weather prediction (MLWP) approaches are efficient and accurate, they are deterministic and cannot capture the uncertainty of weather forecasting. To tackle these challenges, we propose CoDiCast\texttt{CoDiCast}, a conditional diffusion model to generate accurate global weather prediction, while achieving uncertainty quantification and modest computational cost. The key idea behind the prediction task is to generate realistic weather scenarios at a future\textit{future} time point, conditioned on observations from the recent past\textit{recent past}. Due to the probabilistic nature of diffusion models, they can be properly applied to capture the uncertainty of weather predictions. Therefore, we accomplish uncertainty quantifications by repeatedly sampling from stochastic Gaussian noise for each initial weather state and running the denoising process multiple times. Experimental results demonstrate that CoDiCast\texttt{CoDiCast} outperforms several existing MLWP methods in accuracy, and is faster than NWP models in the inference speed. CoDiCast\texttt{CoDiCast} can generate 3-day global weather forecasts, at 6-hour steps and 5.6255.625^\circ latitude-longitude resolutions, for over 5 variables, in about 12 minutes on a commodity A100 GPU machine with 80GB memory. The anonymous code is provided at \url{https://anonymous.4open.science/r/CoDiCast/}.

313Boosting Offline Multi-Objective Reinforcement Learning via Preference Conditioned Diffusion Models

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) addresses sequential decision-making problems with multiple objectives by learning policies optimized for diverse preferences. While traditional methods necessitate costly online interaction with the environment, recent approaches leverage static datasets containing pre-collected trajectories, making offline MORL the preferred choice for real-world applications. However, existing offline MORL techniques suffer from limited expressiveness and poor generalization on out-of-distribution (OOD) preferences. To overcome these limitations, we propose Diffusion-based Multi-Objective Reinforcement Learning (DiffMORL), a generalizable diffusion-based planning framework for MORL. Leveraging the strong expressiveness and generation capability of diffusion models, DiffMORL further boosts its generalization through offline data mixup, which mitigates the memorization phenomenon and facilitates feature learning by data augmentation. By training on the augmented data, DiffMORL is able to condition on a given preference, whether in-distribution or OOD, to plan the desired trajectory and extract the corresponding action. Experiments conducted on the D4MORL benchmark demonstrate that DiffMORL achieves state-of-the-art results across nearly all tasks. Notably, it surpasses the best baseline on most tasks, underscoring its remarkable generalization ability in offline MORL scenarios.

314A Trajectory Probability Network for City-Scale Road Volume Prediction

[openreview] [pdf]

Abstract City-scale road volume prediction is a fundamental task in traffic management. However, the observation data are often incomplete and biased, posting a challenge for accurate prediction. Existing methods address this issue through interpolation techniques or manual priors, but they typically provide only a deterministic restoration, overlooking the influence of other potential scenarios. To overcome these limitations, we propose a novel neural network-based probabilistic model, the Trajectory Probability Network (TraPNet), which predicts traffic volume through the aggregation of the joint distribution of potential trajectories. TraPNet makes full use of current observations, historical data, and road network information to offer a comprehensive inference of road volumes. Unlike autoregressive methods, TraPNet makes predictions in a single step, substantially reducing computational time while maintaining high predictive accuracy. Experiments on real-world road networks demonstrate that TraPNet outperforms state-of-the-art methods, and can keep the advantage with only 20% observation ratio. The code will be made publicly available.

315Disentangling data distribution for Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

316Cross-Domain Offline Policy Adaptation with Optimal Transport and Dataset Constraint

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) often struggles with limited data. This work explores cross-domain offline RL where offline datasets (with possibly sufficient data) from another domain can be accessed to facilitate policy learning. However, the underlying environments of the two datasets may have dynamics mismatches, incurring inferior performance when simply merging the data of two domains. Existing methods mitigate this issue by training domain classifiers, using contrastive learning methods, etc. Nevertheless, they still rely on a large amount of target domain data to function well. Instead, we address this problem by establishing a concrete performance bound of a policy given datasets from two domains. Motivated by the theoretical insights, we propose to align transitions in the two datasets using optimal transport and selectively share source domain samples, without training any neural networks. This enables reliable data filtering even given a few target domain data. Additionally, we introduce a dataset regularization term that ensures the learned policy remains within the scope of the target domain dataset, preventing it from being biased towards the source domain data. Consequently, we propose the Optimal Transport Data Filtering (dubbed OTDF) method and examine its effectiveness by conducting extensive experiments across various dynamics shift conditions (e.g., gravity shift, morphology shift), given limited target domain data. It turns out that OTDF exhibits superior performance on many tasks and dataset qualities, often surpassing prior strong baselines by a large margin.

317Federated Adapter on Foundation Models: An Out-Of-Distribution Approach

[openreview] [pdf]

Abstract As foundation models gain increasing attention from both academic and industrial communities, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach for collaboratively fine-tuning models in federated learning (FL) frameworks using distributed datasets across multiple clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients may exhibit distribution shifts leading to suboptimal performance. Although numerous studies have explored OOD generalization in conventional FL, these methods are inadequate for FedFM due to the challenges posed by large parameter scales and increased data heterogeneity, where large parameter scales would result in high computational and communication costs while increased data heterogeneity like cross-domain would lead to suboptimal performance of the aggregated global model on individual client distributions. To bridge this gap, we propose a new method, called FedOA, to enhance the OOD generalization of FedFM under these conditions. Specifically, our method employs adapter-based parameter-efficient fine-tuning methods for efficient learning, and introduces an additional personalized model with a feature distance-based regularization to ensure distribution alignment and provide OOD generalization guarantees for each client. Theoretically, we demonstrate that the conventional aggregated global model in FedFM inherently retains OOD generalization capabilities, and our proposed method enhances the personalized model’s OOD generalization through regularization informed by the global model, with proven convergence under general non-convex settings. Empirically, the effectiveness of the proposed method is validated on benchmark datasets across various NLP tasks.

318Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control

[openreview] [pdf]

Abstract Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth “genuine” reward, as is the case in many practical applications. These challenges, collectively termed “reward collapse,” pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.

319Understanding Impact of Human Feedback via Influence Functions

[openreview] [pdf]

Abstract In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. In our experiments, we demonstrate two key applications of influence functions: (1) detecting common forms of labeler bias in human feedback datasets and (2) guiding labelers to refine their strategies to align more closely with expert feedback. By quantifying the impact of human feedback on reward models, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback.

320Diffusion Models Are Real-Time Game Engines

[openreview] [pdf]

Abstract We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

321On the Convergence of FedProx with Extrapolation and Inexact Prox

[openreview] [pdf]

Abstract Enhancing the FedProx federated learning algorithm (Li et al., 2020) with server-side extrapolation, Li et al. (2024a) recently introduced the FedExProx method. Their theoretical analysis, however, relies on the assumption that each client computes a certain proximal operator exactly, which is impractical since this is virtually never possible to do in real settings. In this paper, we investigate the behavior of FedExProx without this exactness assumption in the smooth and globally strongly convex setting. We establish a general convergence result, showing that inexactness leads to convergence to a neighborhood of the solution. Additionally, we demonstrate that, with careful control, the adverse effects of this inexactness can be mitigated. By linking inexactness to biased compression (Beznosikov et al., 2023), we refine our analysis, highlighting robustness of extrapolation to inexact proximal updates. We also examine the local iteration complexity required by each client to achieved the required level of inexactness using various local optimizers. Our theoretical insights are validated through comprehensive numerical experiments.

322Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, policy-regularized methods, address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in nearly all environments.

323Denoising Diffusion Causal Discovery

[openreview] [pdf]

Abstract A common theme across multiple disciplines of science is to understand the underlying dependencies between variables from observational data. Such dependencies are often modeled as Bayesian Network (BNs), which by definition are Directed Acyclic Graphs (DAGs). Recent advancements, such as NOTEARS and DAG-GNN, have focused on formulating continuous DAG constraints and learning DAGs via continuous optimization. However, these methods often have scalability issues and face challenges when applied to real world data. In this paper, we propose Denoising Diffusion Causal Discovery (DDCD), a new learning framework that leverages Denoising Diffusion Probabilistic Models (DDPMs) for causal structural learning. Using the denoising objective, our method allows the model to explore a wider range of noise in the data and effectively captures both linear and nonlinear dependencies. It also has reduced complexity and is more suitable for inference of larger networks. To accommodate potential feedback loops in biological networks, we propose a k-hop DAG constraint. Additionally, we suggest using fixed-size bootstrap sampling to ensure similar training performance across varying dataset sizes. Our experiments on synthetic data demonstrate that DDCD achieves consistent competitive performance compared to existing methods while noticeably reducing computation time. We also show that DDCD can generate trustworthy networks from real-world datasets.

324MGD3: Mode-Guided Dataset Distillation using Diffusion Models

[openreview] [pdf]

Abstract Dataset distillation aims to synthesize a smaller training set from a large dataset such that a model trained on this distilled set performs comparably to one trained on the entire dataset. For image classification, earlier methods proposed optimization strategies in the input space to synthesize a distilled dataset, but they are computationally expensive and difficult to scale to higher resolutions. Also, the datasets synthesized by these methods lack intra-class diversity as they ignore the modes of the data distribution. Recent works propose using generative models, among which diffusion models have shown promising results as they are known to capture the data distribution effectively. However, diffusion models tend to over-sample from the prominent modes of the data distribution, resulting in limited diversity in the generated samples. To address these limitations in this work, we propose a mode-guided diffusion model. Unlike existing works that fine-tune the diffusion models for dataset distillation, we propose to use a pre-trained model without the need for fine-tuning. Our novel approach consists of three stages: Mode Discovery, Mode Guidance, and Stop Guidance. In the first stage, we discover distinct modes in the data distribution of a class to build a representative set. In the second stage, we use a pre-trained diffusion model and guide the diffusion process toward the discovered modes to generate distinct samples, ensuring intra-class diversity. However, mode-guided sampling can introduce artifacts in the synthetic sample, which affect the performance. To control the fidelity of the synthetic dataset, we introduce the stop guidance. We evaluate our method on multiple benchmark datasets, including ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K; Our method improved 4.4%, 2.9%, 1.6%, and 1.6% over the current state-of-the-art on the respective datasets. In addition, our method does not require retraining of the diffusion model, which leads to reduced computational requirements. We also demonstrate that our approach is effective with general-purpose diffusion models such as Text-to-Image Stable Diffusion, eliminating the need for a pre-trained model in the target dataset.

325Direct Judgement Preference Optimization

[openreview] [pdf]

Abstract Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to both evaluate model responses and generate natural language critiques. However, existing models have been trained almost exclusively with supervised fine-tuning (SFT), often only on a small number of datasets, resulting in poor generalization across different evaluation settings and tasks. In this paper, we investigate how learning from both positive and negative data with direct preference optimization (DPO) enhances the evaluation capabilities of LLM judges across three evaluation tasks: pairwise, single ratings, and binary classification. We achieve this by creating three forms of DPO data from a diverse collection of human and synthetic judgements on contemporary model outputs, with the goal of training our model to generate meaningful critiques, make accurate judgements, and understand what constitutes good and bad responses for a given user input. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and conduct a comprehensive study over 13 benchmarks (7 pairwise, 4 single rating, and 2 classification), measuring agreement with human and GPT-4 annotations. Our models exhibit the best aggregate performance, with even our 8B model outperforming strong baselines like GPT-4o and specialized judge models, such as OffsetBias-8B, Auto-J-13B, Prometheus-2-8x7B, and Skywork-Critic-70B, in pairwise benchmarks. Further analysis shows that our judge model robustly counters biases such as position and length bias, flexibly adapts to practitioner-specified evaluation protocols, and provides helpful language feedback for improving downstream generator models.

326FedSUV: Validity and Utility-guided Client Selection for Federated Learning

[openreview] [pdf]

Abstract Federated Learning faces significant challenges arising from two critical uncertainties: the validity of a client’s participation, which can be compromised by network and system heterogeneity, and the utility of the data contributed by each client, which varies due to heterogeneous statistical data. Traditional client selection methods often treat these uncertainties as a whole, leading to suboptimal performance. To address this issue, we propose FedSUV, an innovative client selection framework that decouples validity and utility uncertainties. FedSUV approaches client selection from a multi-objective optimization perspective, employing advanced bandit algorithms: a confidence bound-based linear contextual bandit for assessing validity and a Gaussian Process bandit for evaluating utility. We validate the effectiveness of FedSUV through both theoretical analysis and large-scale experiments conducted within our physical cluster.

327Fast constrained sampling in pre-trained diffusion models

[openreview] [pdf]

Abstract Diffusion models have dominated the field of large, generative image models, with the prime examples of Stable Diffusion and DALL-E 3 being widely adopted. These models have been trained to perform text-conditioned generation on vast numbers of image-caption pairs and as a byproduct, have acquired general knowledge about natural image statistics. However, when confronted with the task of constrained sampling, e.g. generating the right half of an image conditioned on the known left half, applying these models is a delicate and slow process, with previously proposed algorithms relying on expensive iterative operations that are usually orders of magnitude slower than text-based inference. This is counter-intuitive, as image-conditioned generation should rely less on the difficult-to-learn semantic knowledge that links captions and imagery, and should instead be achievable by lower-level correlations among image pixels. In practice, inverse models are trained or tuned separately for each inverse problem, e.g. by providing parts of images during training as an additional condition, to allow their application in realistic settings. However, we argue that this is not necessary and propose an algorithm for fast-constrained sampling in large pre-trained diffusion models (Stable Diffusion) that requires no expensive backpropagation operations through the model and produces results comparable even to the state-of-the-art \emph{tuned} models. Our method is based on a novel optimization perspective to sampling under constraints and employs a numerical approximation to the expensive gradients, previously computed using backpropagation, incurring significant speed-ups.

328Can We Ignore Labels in Out of Distribution Detection?

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection methods have recently become more prominent, serving as a core element in safety-critical autonomous systems. One major purpose of OOD detection is to reject invalid inputs that could lead to unpredictable errors and compromise safety. Due to the cost of labeled data, recent works have investigated the feasibility of self-supervised learning (SSL) OOD detection, unlabled OOD detection, and zero shot OOD detection. In this work, we identify a set of conditions for a theoretical guarantee of failure in unlabeled OOD detection algorithms from an information-theoretic perspective. These conditions are present in all OOD tasks dealing with real world data: I) we provide theoretical proof of unlabeled OOD detection failure when there exists zero mutual information between the learning objective and the in-distribution labels, a.k.a. ‘label blindness’, II) we define a new OOD task – Adjacent OOD detection – that tests for label blindness and accounts for a previously ignored safety gap in all OOD detection benchmarks, and III) we perform experiments demonstrating that existing unlabeled OOD methods fail under conditions suggested by our label blindness theory and analyze the implications for future research in unlabeled OOD methods.

329Rectified Diffusion Guidance for Conditional Generation

[openreview] [pdf]

Abstract Classifier-Free Guidance (CFG), which combines the conditional and unconditional score functions with two coefficients summing to one, serves as a practical technique for diffusion model sampling. Theoretically, however, denoising with CFGcannotbe expressed as a reciprocal diffusion process, which may consequently leave some hidden risks during use. In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (i.e., the widely used summing-to-one version) brings about expectation shift of the generative distribution. To rectify this issue, we propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory. We further show that our approach enjoys aclosed-formsolution given the guidance strength. That way, the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected. Empirical evidence on real-world data demonstrate the compatibility of our post-hoc design with existing state-of-the-art diffusion models, including both class-conditioned ones (e.g., EDM2 on ImageNet) and text-conditioned ones (e.g., SD3 on CC12M), without any retraining. We will open-source the code to facilitate further research.

330Quality Diversity Imitation Learning

[openreview] [pdf]

Abstract Imitation learning (IL) has shown great potential in various applications, such as robot control. However, traditional IL methods are usually designed to learn only one specific type of behavior since demonstrations typically correspond to a single expert. In this work, we introduce the first generic framework for Quality Diversity Imitation Learning (QD-IL), which enables the agent to learn a broad range of skills from limited demonstrations. Our framework integrates the principles of quality diversity with adversarial imitation learning (AIL) methods, and can potentially improve any inverse reinforcement learning (IRL) method. Empirically, our framework significantly improves the QD performance of GAIL and VAIL on the challenging continuous control tasks derived from Mujoco environments. Moreover, our method even achieves 2x expert performance in the most challenging Humanoid environment.

331C2INet: Realizing Incremental Trajectory Prediction with Prior-Aware Continual Causal Intervention

[openreview] [pdf]

Abstract Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C2INet) method for generalizable multi-agent trajectory prediction within a continual learning framework. Using variational inference, we align environment-related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C2INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state-of-the-art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real-world applications.

332Neural Approximate Mirror Maps for Constrained Diffusion Models

[openreview] [pdf]

Abstract Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a diffusion model makes it more reliable for generating valid synthetic data and solving constrained inverse problems. However, existing methods for constrained diffusion models are restricted in the constraints they can handle. For instance, recent work proposed to learn mirror diffusion models (MDMs), but analytical mirror maps only exist for convex constraints and can be challenging to derive. We proposeneural approximate mirror maps(NAMMs) for general, possibly non-convex constraints. Our approach only requires a differentiable distance function from the constraint set. We learn an approximate mirror map that transforms data into an unconstrained space and a corresponding approximate inverse that maps data back to the constraint set. A generative model, such as an MDM, can then be trained in the learned mirror space and its samples restored to the constraint set by the inverse map. We validate our approach on a variety of constraints, showing that compared to an unconstrained diffusion model, a NAMM-based MDM substantially improves constraint satisfaction. We also demonstrate how existing diffusion-based inverse-problem solvers can be easily applied in the learned mirror space to solve constrained inverse problems.

333Uncertainty-Regularized Diffusional Subgoals for Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract Hierarchical reinforcement learning (HRL) aims to solve complex tasks by making decisions across multiple levels of temporal abstraction. However, off-policy training of hierarchical policies faces non-stationarity issues because the low-level policy is constantly changing, which makes it difficult for the high-level policy that generates subgoals to adapt. In this paper, we propose a conditional diffusion model-based approach for subgoal generation to mitigate these non-stationarity challenges. Specifically, we employ a Gaussian Process (GP) prior on subgoal generation as a surrogate distribution to regularize the diffusion policy and inform the diffusion process about uncertain areas in the action space. We introduce adaptive inducing states to facilitate sparse GP-based subgoal generation, enhancing sample efficiency and promoting better exploration in critical regions of the state space. Building on this framework, we develop an exploration strategy that identifies promising subgoals based on the learned predictive distribution of the diffusional subgoals. Experimental results demonstrate significant improvements in both sample efficiency and performance on challenging continuous control benchmarks compared to prior HRL methods.

334Scaling Optimal LR Across Token Horizons

[openreview] [pdf]

Abstract State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size, and cluster size. It is economically infeasible to extensively tune hyperparameters for the largest runs. Instead, approximately optimal hyperparameters must be inferred or transferred from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et. al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly, we demonstrate that the optimal LR follows a scaling law and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly, we provide evidence that LLama-1 used too high LR, and argue that hyperparameter transfer across data size is an overlooked component of LLM training.

335Can foundation models actively gather information in interactive environments to test hypotheses?

[openreview] [pdf]

Abstract While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving---actively and strategically gathering information to test hypotheses---has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that Gemini’s information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model’s effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.

336Bootstrapped Model Predictive Control

[openreview] [pdf]

Abstract Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available athttps://github.com/bmpc-anonymous/bmpc.

337Exploring Diffusion Models’ Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

[openreview] [pdf]

Abstract Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by heuristically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks. The code is available at an anonymous link.

338FairCoT: Enhancing Fairness in Diffusion Models via Chain of Thought Reasoning of Multimodal Language Models

[openreview] [pdf]

Abstract In the domain of text-to-image generative models, biases inherent in training datasets often propagate into generated content, posing significant ethical challenges, particularly in socially sensitive contexts. We introduce FairCoT, a novel framework that enhances fairness in diffusion models through Chain-of-Thought (CoT) reasoning within multimodal generative large language models (LLMs). FairCoT employs iterative CoT refinement and attire-based attribute prediction to systematically mitigate biases, ensuring diverse and equitable representation in generated images. By integrating iterative reasoning processes, FairCoT addresses the limitations of zero-shot CoT in sensitive scenarios, balancing creativity with ethical responsibility. Experimental evaluations across multiple models, including DALL-E and various Stable Diffusion variants, demonstrate that FairCoT significantly improves fairness and diversity metrics without compromising image quality or relevance. Our approach advances ethical AI practices in generative modeling, promoting socially responsible content generation and setting new standards for fairness in AI-generated imagery.

339Win Rate is All that Can Matter from Preference Data Alone

[openreview] [pdf]

Abstract The surging interest in learning from preference data has resulted in an elaborate landscape of methods and evaluations. This work offers a framework to simplify this landscape. We start with the insight that the only fixed information represented in preference data is the preference classifier, and thus the only evaluation of a model grounded in the data is win rate under this classifier. In other words, win rate is all that can matter from preference data alone. This insight allows us to unlock many follow-up insights. First, we introduce a family of objectives to directly optimize for win rate, called Direct Win Rate Optimization (DWRO) objectives. We show that Reinforcement Learning From Human Feedback (RLHF) is a KL-regularized DWRO objective while SFT on preferred samples is not. We then compare the target distributions of various preference learning objectives and explain how different design choices affect sharpness of the resulting distribution. Furthermore, we provide close-formed solutions for the expected win rate improvement of common preference learning algorithms and explain the intuitions they provide. Our analysis and accompanying experiments not only elucidate the design space of preference learning algorithms but also offer guidance on future directions to advance preference learning.

340Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like τ-leaping accelerate this process, they introduceCompounding Decoding Error(CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we presentJump Your Steps(JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.

341Diffusion-based Prompt Generation for Lifelong Continual Adaptation

[openreview] [pdf]

Abstract Continual Test-time Adaptation (TTA) addresses sequential out-of-distribution scenarios with unlabeled data but overlooks long-term and recurring in-distribution aspects of the real world. Therefore, we introduce Lifelong Continual Adaptation, which enables models to efficiently retrieve domain-specific knowledge when encountering in-distribution data streams with sequential and recurring domains. We found that optimization-based Continual TTA methods underperform on the proposed problem due to two major pitfalls: updating the model’s parameters is expensive and impractical for resource-constrained devices, and these methods exhibit instability when adapting to long-term recurring domains. To address these challenges, we propose a diffusion-based prompt generation method (DiffPrompt). Specifically, instead of continually optimizing the foundation model, we generate domain-specific prompts for it to adapt. We use a conditional diffusion model to learn a prompt-space distribution for various domains. During testing, the diffusion model generates prompts for the current domain based on the incoming batch of data, facilitating the continual adaptation of the foundation model. Our experiments demonstrate that DiffPrompt enables stable and efficient deployment in practical scenarios involving sequential and recurring domains.

342Knowledge Localization: Mission Not Accomplished? Enter Query Localization!

[openreview] [pdf]

Abstract Large language models (LLMs) store extensive factual knowledge, but the mechanisms behind how they store and express this knowledge remain unclear. The Knowledge Neuron (KN) thesis is a prominent theory for explaining these mechanisms. This theory is based on theKnowledge Localization (KL)assumption, which suggests that a fact can be localized to a few knowledge storage units, namely knowledge neurons. However, this assumption has two limitations: first, it may be too rigid regarding knowledge storage, and second, it neglects the role of the attention module in knowledge expression.In this paper, we first re-examine the KL assumption and demonstrate that its limitations do indeed exist. To address these, we then present two new findings, each targeting one of the limitations: one focusing on knowledge storage and the other on knowledge expression. We summarize these findings asQuery Localizationassumption and argue that the KL assumption can be viewed as a simplification of the QL assumption. Based on QL assumption, we further propose the Consistency-Aware KN modification method, which improves the performance of knowledge modification, further validating our new assumption. We conduct 39 sets of experiments, along with additional visualization experiments, to rigorously confirm our conclusions. Code will be made public soon.

343Boosting Latent Diffusion with Perceptual Objectives

[openreview] [pdf]

Abstract Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

344Alternating Projections With Volume Sampling

[openreview] [pdf]

Abstract The method of Alternating Projections (AP) is a fundamental iterative technique with applications to problems in machine learning, optimization and signal processing. Examples include the Gauss-Seidel algorithm which is used to solve large-scale regression problems and the Kaczmarz and projections onto convex sets (POCS) algorithms that are fundamental to iterative reconstruction. Progress has been made with regards to the questions of efficiency and rate of convergence in the randomized setting of the AP method. Here, we extend these results with volume sampling to block (batch) sizes greater than 1 and provide explicit formulas that relate the convergence rate bounds to the spectrum of the underlying system. These results, together with a trace formula and associated volume sampling, prove that convergence rates monotonically improve with larger block sizes, a feature that can not be guaranteed in general with uniform sampling (e.g., in SGD).

345Event-Driven Online Vertical Federated Learning

[openreview] [pdf]

Abstract Online learning is more adaptable to real-world scenarios in Vertical Federated Learning (VFL) compared to offline learning. However, integrating online learning into VFL presents challenges due to the unique nature of VFL, where clients possess non-intersecting feature sets for the same sample. In real-world scenarios, the clients may not receive data streaming for the disjoint features for the same entity synchronously. Instead, the data are typically generated by aneventrelevant to only a subset of clients. We are the first to identify these challenges in online VFL, which have been overlooked by previous research. To address these challenges, we proposed an event-driven online VFL framework. In this framework, only a subset of clients were activated during each event, while the remaining clients passively collaborated in the learning process. Furthermore, we incorporateddynamic local regret (DLR)into VFL to address the challenges posed by online learning problems with non-convex models within a non-stationary environment. We conducted a comprehensive regret analysis of our proposed framework, specifically examining the DLR under non-convex conditions with event-driven online VFL. Extensive experiments demonstrated that our proposed framework was more stable than the existing online VFL framework under non-stationary data conditions while also significantly reducing communication and computation costs.

346Improving Generalization of Meta Reinforcement Learning via Explanation

[openreview] [pdf]

Abstract Meta reinforcement learning learns a meta-prior (e.g., meta-policy) from a set of training tasks, such that the learned meta-prior can efficiently adapt to all the tasks in a task distribution. However, it has been observed in literature that the learned meta-prior usually has imbalanced generalization, i.e., it adapts well to some tasks but adapts poorly to some other tasks. This paper aims to explain why certain tasks are poorly adapted and, more importantly, use this explanation to improve generalization. Our methodology has two parts. The first part identifies ``critical" training tasks that are most important to achieve good performance on those poorly-adapted tasks. An explanation of the poor generalization is that the meta-prior does not pay enough attention to the critical training tasks. To improve generalization, the second part formulates a bi-level optimization problem where the upper level learns how to augment the critical training tasks such that the meta-prior can pay more attention to the critical tasks, and the lower level computes the meta-prior distribution corresponding to the current augmentation. We propose an algorithm to solve the bi-level optimization problem and theoretically guarantee that (1) the algorithm converges at the rate of O(1/K)O(1/\sqrt{K}), (2) the learned augmentation makes the meta-prior focus more on the critical training tasks, and (3) the generalization improves after the task augmentation. We use two real-world experiments and three MuJoCo experiments to show that our algorithm improves the generalization and outperforms state-of-the-art baselines.

347DiffPath: Generating Road Network based Path with Latent Diffusion Model

[openreview] [pdf]

Abstract With the increasing use of GPS technology, path has become essential for applications such as navigation, urban planning, and traffic optimization. However, obtaining real-world path presents challenges due to privacy concerns and the difficulty of collecting large datasets. Existing methods, including count-based and deep learning approaches, struggle with two main challenges: handling complex distributions of path segments and ensuring global coherence in generated paths. To address these, we introduce DiffPath, a path generation model based on Latent Diffusion Models (LDMs). By embedding path into a continuous latent space and leveraging a transformer architecture, DiffPath captures both local transitions and global dependencies, ensuring the generation of realistic paths. Experimental results demonstrate that our model outperforms existing approaches in generating paths that adhere to real-world road network structures while maintaining privacy.

348Average Certified Radius is a Poor Metric for Randomized Smoothing

[openreview] [pdf]

Abstract Randomized smoothing is a popular approach for providing certified robustness guarantees against adversarial attacks, and has become a very active area of research. Over the past years, the average certified radius (ACR) has emerged as the single most important metric for comparing methods and tracking progress in the field. However, in this work, we show that ACR is an exceptionally poor metric for evaluating robustness guarantees provided by randomized smoothing. We theoretically show not only that a trivial classifier can have arbitrarily large ACR, but also that ACR is much more sensitive to improvements on easy samples than on hard ones. Empirically, we confirm that existing training strategies that improve ACR reduce the model’s robustness on hard samples. Further, we show that by focusing on easy samples, we can effectively replicate the increase in ACR. We develop strategies, including explicitly discarding hard samples, reweighing the dataset with certified radius, and extreme optimization for easy samples, to achieve state-of-the-art ACR, although these strategies ignore robustness for the general data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and better metrics are required to holistically evaluate randomized smoothing.

349Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems

[openreview] [pdf]

Abstract When solving inverse problems, it is increasingly popular to use pre-trained diffusion models as plug-and-play priors. This framework can accommodate different forward models without re-training while preserving the generative capability of diffusion models. Despite their success in many imaging inverse problems, most existing methods rely on privileged information such as derivative, pseudo-inverse, or full knowledge about the forward model. This reliance poses a substantial limitation that restricts their use in a wide range of problems where such information is unavailable, such as many scientific applications. To address this, we propose Ensemble Kalman Diffusion Guidance (EnKG) for diffusion models, a derivative-free approach that can solve inverse problems by only accessing forward model evaluations and a pre-trained diffusion model. We study the empirical effectiveness of our method across various inverse problems, including scientific settings such as inferring fluid flows and astronomical objects, which are highly non-linear inverse problems that often only permit black-box access to the forward model.

350Improved Sampling Algorithms for Lévy-Itô Diffusion Models

[openreview] [pdf]

Abstract Lévy-Itô denoising diffusion models relying on isotropic α-stable noise instead of Gaussian distribution have recently been shown to improve performance of conventional diffusion models in image generation on imbalanced datasets while performing comparably in the standard settings. However, the stochastic algorithm of sampling from such models consists in solving the stochastic differential equation describing only an approximate inverse of the process of adding α-stable noise to data which may lead to suboptimal performance. In this paper, we derive a parametric family of stochastic differential equations whose solutions have the same marginal densities as those of the forward diffusion and show that the appropriate choice of the parameter values can improve quality of the generated images when the number of reverse diffusion steps is small. Also, we demonstrate that Lévy-Itô diffusion models are applicable to diverse domains and show that a well-trained text-to-speech Lévy-Itô model may have advantages over standard diffusion models on highly imbalanced datasets.

351Fast and Slow Streams for Online Time Series Forecasting Without Information Leakage

[openreview] [pdf]

Abstract Current research in online time series forecasting suffers from information leakage: models predict and then evaluate on historical time steps that have been backpropagated for parameter updates. This setting also misaligns with the real-world conception of forecasting, which typically emphasizes looking ahead and anticipating future uncertainties. This paper redefines online time series forecasting to focus on predicting unknown future steps and evaluates performance solely based on these predictions. Following this new setting, challenges arise in leveraging incomplete pairs of ground truth and prediction for backpropagation, as well as generalizing accurate information without overfitting to noises from recent data streams. To address these challenges, we propose a novel dual-stream framework for online forecasting (DSOF): a slow stream that updates with complete data using experience replay, and a fast stream that adapts to recent data through temporal difference learning. This dual-stream approach updates a teacher-student model learned through a residual learning strategy, generating predictions in a coarse-to-fine manner. Extensive experiments demonstrate its improvement in forecasting performance in changing environments.

352Flexible Fairness-Aware Learning via Inverse Conditional Permutation

[openreview] [pdf]

Abstract Equalized odds, as a popular notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm’s prediction when conditioning on the true outcome. Despite rapid advancements, current research primarily focuses on equalized odds violations caused by a single sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes largely unaddressed. We bridge this gap by introducing an in-processing fairness-aware learning approach, FairICP, which integrates adversarial learning with a novel inverse conditional permutation scheme. FairICP offers a theoretically justified, flexible, and efficient scheme to promote equalized odds under fairness conditions described by complex and multi-dimensional sensitive attributes. The efficacy and adaptability of our method are demonstrated through both simulation studies and empirical analyses of real-world datasets.

353ASOR: Anchor State Oriented Regularization for Policy Optimization under Dynamics Shift

[openreview] [pdf]

Abstract To train neural policies in environments with diverse dynamics, Imitation from Observation (IfO) approaches aim at recovering expert state trajectories. Their success is built upon the assumption that the stationary state distributions induced by optimal policies remain similar despite dynamics shift. However, such an assumption does not hold in many real world scenarios, especially when certain states become inaccessible during environment dynamics change. In this paper, we propose the concept of anchor states which appear in all optimal trajectories under dynamics shift, thereby maintaining consistent state accessibility. Instead of direct imitation, we incorporate anchor state distributions into policy regularization to mitigate the issue of inaccessible states, leading to the ASOR algorithm. By formally characterizing the difference of state accessibility under dynamics shift, we show that the anchor state-based regularization approach provides strong lower- bound performance guarantees for efficient policy optimization. We perform extensive experiments across various online and offline RL benchmarks, including Gridworld, MuJoCo, MetaDrive, D4RL, and a fall-guys like game environment, featuring multiple sources of dynamics shift. Experimental results indicate ASOR can be effectively integrated with several state-of-the-art cross-domain policy transfer algorithms, substantially enhancing their performance.

354Imputation for prediction: beware of diminishing returns.

[openreview] [pdf]

Abstract Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifyingifandwheninvesting in advanced imputation methods yields significantly better predictions. Relating imputation and predictive accuracies across combinations of imputation and predictive models on 19 datasets, we show that imputation accuracy matters less i) when using expressive models, ii) when incorporating missingness indicators as complementary inputs, iii) matters much more for generated linear outcomes than for real-data outcomes. Interestingly, we also show that the use of the missingness indicator is beneficial to the prediction performance, even in MCAR scenarios. Overall, on real-data with powerful models, imputation quality has only a minor effect on prediction performance. Thus, investing in better imputations for improved predictions often offers limited benefits.

355DSPO: Direct Score Preference Optimization for Diffusion Model Alignment

[openreview] [pdf]

Abstract Diffusion-based Text-to-Image (T2I) models have achieved impressive success in generating high-quality images from textual prompts. While large language models (LLMs) effectively leverage Direct Preference Optimization (DPO) for fine-tuning on human preference data without the need for reward models, diffusion models have not been extensively explored in this area. Current preference learning methods applied to T2I diffusion models immediately adapt existing techniques from LLMs. However, this adaptation introduces a mismatch between the pretraining and the fine-tuning objectives specific to T2I diffusion models. This inconsistency can potentially lead to suboptimal performance. In this work, we propose Direct Score Preference Optimization (DSPO), a novel algorithm that aligns the pretraining and fine-tuning objectives of diffusion models by leveraging score matching, the same objective used during pretraining. It introduces a new perspective on preference learning for diffusion models. Specifically, DSPO distills the score function of human-preferred image distributions into pretrained diffusion models, fine-tuning the model to generate outputs that align with human preferences. We theoretically show that DSPO shares the same optimization direction as reinforcement learning algorithms in diffusion models under certain conditions. Our experimental results demonstrate that DSPO outperforms preference learning baselines for T2I diffusion models in human preference evaluation tasks and enhances both visual appeal and prompt alignment of generated images.

356Exploring Local Memorization in Diffusion Models via Bright Ending Attention

[openreview] [pdf]

Abstract In this paper, we identify and leverage a novel `bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models. BE refers to a distinct cross-attention pattern observed in text-to-image generations using diffusion models. Specifically, memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches. This attention map effectively highlights regions where the generated image replicates training data. Furthermore, driven by our observation that local memorization significantly underperforms in existing tasks of measuring, detecting, and mitigating memorization in diffusion models compared to global memorization, we propose a simple yet effective method to integrate BE and the results of the new localization task into these existing frameworks. This integration effectively improves their performances by narrowing the performance gap caused by local memorization. Our results not only demonstrate the successful execution of the new localization task but also establish new state-of-the-art performance across all existing tasks, underscoring the significance of the BE phenomenon.

357Backdoor Attacks for LLMs with Weak-To-Strong Knowledge Distillation

[openreview] [pdf]

Abstract Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on feature alignment-enhanced knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through feature alignment-enhanced knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

358Stochastic Diffusion: A Diffusion Based Model for Stochastic Time Series Forecasting

[openreview] [pdf]

Abstract Recent successes in diffusion probabilistic models have demonstrated their strength in modelling and generating different types of data, paving the way for their application in generative time series forecasting. However, most existing diffusion based approaches rely on sequential models and unimodal latent variables to capture global dependencies and model entire observable data, resulting in difficulties when it comes to highly stochastic time series data. In this paper, we propose a novelStochasticDiffusion (StochDiff) model that integrates the diffusion process into time series modelling stage and utilizes the representational power of the stochastic latent spaces to capture the variability of the stochastic time series data. Specifically, the model applies diffusion module at each time step within the sequential framework and learns a step-wise, data-driven prior for generative diffusion process. These features enable the model to effectively capture complex temporal dynamics and the multi-modal nature of the highly stochastic time series data. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed model for probabilistic time series forecasting, particularly in scenarios with high stochasticity. Additionally, with a real-world surgical use case, we highlight the model’s potential in medical application.

359A Modified Proximal-Perturbed Lagrangian for Non-Convex Non-Smooth Representatives of Fairness Constraints

[openreview] [pdf]

Abstract We study classification problems under fairness constraints and introduce an algorithmic framework designed to prevent discrimination against different groups. These problems are often reformulated as continuous constrained optimization problems and are typically solved using continuous relaxations (surrogates) of the fairness constraints. However, many current algorithms do not provide theoretical guarantees, which possibly is due to the resulting fairness constraints being both non-convex and non-smooth. We propose a novel primal-dual algorithm, based on a newly developed Lagrangian, that converges to a stationary solution of the reformulated problem. Our algorithm is not only efficient and robust, but it also enjoys strong performance guarantees on the fairness of its solutions. Furthermore, experimental results demonstrate that our algorithm is highly effective in terms of computational cost and fairness guarantees, outperforming related algorithms that use regularization (penalization) techniques and/or standard Lagrangian relaxation.

360Adaptive Source Localization on Complex Networks via Conditional Diffusion Model

[openreview] [pdf]

Abstract Network propagation issues like the spread of misinformation, cyber threats, or infrastructure breakdowns are prevalent and have significant societal impacts. Identifying the source of such propagation by analyzing snapshots of affected networks is crucial for managing crises like disease outbreaks and enhancing network security. Traditional methods rely on metrics derived from network topology and are limited to specific propagation models, while deep learning models face the challenge of data scarcity. We propose \textbf{ASLDiff}~(\textbf{A}daptive \textbf{S}ource \textbf{L}ocalization \textbf{Diff}sion Model), a novel adaptive source localization diffusion model to achieve accurate and robust source localization across different network topologies and propagation modes by fusing the principles of information propagation and restructuring the label propagation process within the conditioning module. Our approach not only adapts to real-world patterns easily without abundant fine-tuning data but can also generalize to different network topologies easily. Evaluations of various datasets demonstrate ASLDiff’s superior effectiveness, accuracy, and adaptability in real-world applications, showcasing its robust performance across different localization scenarios. The code can be found athttps://anonymous.4open.science/r/ASLDiff-4FE0.

361UTSD: Unified Time Series Diffusion Model

[openreview] [pdf]

Abstract Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adaptor-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. In particular, UTSD shows stable and reliable time series generation, and the empirical results validate the potential of UTSD as a time series foundational model. The source codes of UTSD are publicly available onhttps://anonymous.4open.science/r/UTSD-1BFF.

362ContraDiff: Planning Towards High Return States via Contrastive Learning

[openreview] [pdf]

Abstract The performance of offline reinforcement learning (RL) is sensitive to the proportion of high-return trajectories in the offline dataset. However, in many simulation environments and real-world scenarios, there are large ratios of low-return trajectories rather than high-return trajectories, which makes learning an efficient policy challenging. In this paper, we propose a method called Contrastive Diffuser (ContraDiff) to make full use of low-return trajectories and improve the performance of offline RL algorithms. Specifically, ContraDiff groups the states of trajectories in the offline dataset into high-return states and low-return states and treats them as positive and negative samples correspondingly. Then, it designs a contrastive mechanism to pull the planned trajectory of an agent toward high-return states and push them away from low-return states. Through the contrast mechanism, trajectories with low returns can serve as negative examples for policy learning, guiding the agent to avoid areas associated with low returns and achieve better performance. Through the contrast mechanism, trajectories with low returns provide a ``counteracting force’’ guides the agent to avoid areas associated with low returns and achieve better performance. Experiments on 27 sub-optimal datasets demonstrate the effectiveness of our proposed method. Our code is publicly available at \url{https://anonymous.4open.science/r/ContraDiff}.

363DiffuSolve: Diffusion-Based Solver for Non-Convex Trajectory Optimization

[openreview] [pdf]

Abstract Optimal trajectory design is computationally expensive for nonlinear and high-dimensional dynamical systems. The challenge arises from the non-convex nature of the optimization problem with multiple local optima, which usually requires a global search. Traditional numerical solvers struggle to find diverse solutions efficiently without appropriate initial guesses. In this paper, we introduce DiffuSolve, a general diffusion model-based solver for non-convex trajectory optimization. An expressive diffusion model is trained on pre-collected locally optimal solutions and efficiently samples initial guesses, which then warm-starts numerical solvers to fine-tune the feasibility and optimality. We also present DiffuSolve+, a novel constrained diffusion model with an additional loss in training that further reduces the problem constraint violations of diffusion samples. Experimental evaluations on three tasks verify the improved robustness, diversity, and a 2× to 11× increase in computational efficiency with our proposed method, which generalizes well to trajectory optimization problems of varying challenges.

364f-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

[openreview] [pdf]

Abstract Independent learning is a straightforward solution for fully decentralized learning in cooperative multi-agent reinforcement learning (MARL). The study of independent learning has a history of decades, and the representatives, such as independent Q-learning and independent PPO, can obtain good performance in some benchmarks. However, most independent learning algorithms lack convergence guarantees or theoretical support. In this paper, we propose a general formulation of independent policy optimization, ff-divergence policy optimization. We show the generality of such a formulation and analyze its limitation. Based on this formulation, we further propose a novel independent learning algorithm, TVPO, that theoretically guarantees convergence. Empirically, we show that TVPO outperforms state-of-the-art fully decentralized learning methods in three popular cooperative MARL benchmarks, which verifies the efficacy of TVPO.

365Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

[openreview] [pdf]

Abstract Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, particularly lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that guarantees the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator’s generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time TT; and (3) providing estimable bounds for DMs based solely on the training data, allowing the selection of the optimal TT and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

366Right Time to Learn: Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

[openreview] [pdf]

Abstract Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). While it was originally proposed to train a more compact “student” model from a large “teacher” model, many recent efforts have focused on adapting it as an effective way to promote generalization of the model itself, such as online KD and self KD. Here, we propose an easy-to-use and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in the field of biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. We provide an in-depth theoretical and empirical analysis showing that the benefits of the proposed spacing effect in KD stem from seeking a flat minima during stochastic gradient descent (SGD). We perform extensive experiments to demonstrate the effectiveness of our Spaced KD in improving the learning performance of DNNs (e.g., the additional performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).

367Combating inherent noise for direct preference optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has recently gained traction as a promising approach to align large models with human feedback. It is notable for its effectiveness and ease of application across various models, including Large Language Models (LLMs) and Diffusion Models (DMs). However, the quality of preference data used in DPO training has been largely overlooked. Current datasets, whether annotated by deep learning metrics or crowd-sourced human judgments, often contain noisy labels. This noise can adversely affect the performance of DPO. To address this issue, we propose a novel approach that incorporates a noise-aware metric into the DPO objective. This metric, which includes intra-annotator confidence and inter-annotator stability, helps identify and mitigate the impact of noisy data. We introduce an Adaptive-DPO loss function which improves the DPO loss in two ways: one aims to reduce the influence of noisy samples, while the other is to amplify the impact of clean samples. Our experiments demonstrate that this method effectively handles both synthetic and natural noisy data, leading to improved performance in visual and textual generation tasks. This underscores the practical value of our approach in enhancing model robustness amidst noisy preference data.

368Attaining Human’s Desirable Outcomes in Indirect Human-AI Interaction via Multi-Agent Influence Diagrams

[openreview] [pdf]

Abstract In human-AI interaction, one of the cutting-edge research questions is how AI agents can assist a human to attain their desirable outcomes. Most related work investigated the paradigm where a human is required to physically interact with AI agents, which we call direct human-AI interaction. However, this paradigm would be inapplicable when the scenarios are hazardous to humans, such as mine rescue and recovery. To alleviate this shortcoming, we consider indirect human-AI interaction in this paper. More detailed, a human would rely on additional AI agents which we call AI proxies to interact with other AI agents, to attain the human’s desirable outcomes. We model this interactive process as multi-agent influence diagrams (MAIDs), an augmentation of Bayesian networks to describe games, with Nash equilibrium (NE) as a solution. Nonetheless, in a MAID there may exist multiple NEs, and only one NE is associated with a human’s desirable outcomes. To reach this optimal NE, we propose pre-strategy intervention which is an action to provide AI proxies with more information to make decision towards a human’s desirable outcomes. Furthermore, we demonstrate that a team reward Markov game can be rendered as a MAID. This connection not only interprets the successes and failures of prevailing multi-agent reinforcement learning (MARL) paradigms, but also underpins the implementation of pre-strategy intervention in MARL. In practice, we incorporate pre-strategy intervention into MARL for the team reward Markov game to model the scenarios where all agents are required to achieve a common goal, with partial agents working as AI proxies to attain a human’s desirable outcomes. During training, these AI proxies receive an additional reward encoding the human’s desirable outcomes, and its feasibility is justified in theory. We evaluate the resulting algorithm ProxyAgent in benchmark MARL environments for teamwork, with additional goals as a human’s desirable outcomes.

369ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY

[openreview] [pdf]

Abstract Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks. However, extensive demonstrations are required for policy robustness and generalization. To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient trajectory-level SE(3) equivariant diffusion model for generating action sequences in complex robot manipulation tasks. Further, previous equivariant diffusion models require the per-step equivariance in the Markov process, making it difficult to learn policy under such strong constraints. We theoretically extend equivariant Markov kernels and simplify the condition of equivariant diffusion process, thereby significantly improving training efficiency for trajectory-level SE(3) equivariant diffusion policy in an end-to-end manner. We evaluate ET-SEED on representative robotic manipulation tasks, involving rigid body, articulated and deformable object. Experiments demonstrate superior data efficiency and manipulation proficiency of our proposed method, as well as its ability to generalize to unseen configurations with only a few demonstrations. Website:https://et-seed.github.io/

370Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting

[openreview] [pdf]

Abstract The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks (STGNNs) have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly-arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method,EAC, following two fundamental tuning principles guided by empirical and theoretical analysis:expandandcompress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base STGNN with a continuous prompt pool, utilizing stored prompts (\ie, few learnable parameters) in memory, and jointly optimize them with the base STGNN. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority ofEACover the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.

371Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds

[openreview] [pdf]

Abstract This paper investigateslist replicability[Dixon et al., 2023] in the context of multi-armed (also linear) bandits (MAB). We define an algorithm AA for MAB to be (,δ)(\ell,\delta)-list replicable if with probability at least 1δ1-\delta, AA has at most \ell traces in independent executions even with different random bits, where a trace means sequence of arms played during an execution. For kk-armed bandits, although the total number of traces can be Ω(kT)\Omega(k^T) for a time horizon TT, we present several surprising upper bounds that either independent of or logarithmic of TT: (1) a (2k,δ)(2^{k},\delta)-list replicable algorithm with near-optimal regret, O~kT\widetilde{O}{\sqrt{kT}}, (2) a (O(k/δ),δ)(O(k/\delta),\delta)-list replicable algorithm with regret O~(kδkT)\widetilde{O}\left(\frac{k}{\delta}\sqrt{kT}\right), (3) a ((k+1)B1,δ)((k+1)^{B-1}, \delta)-list replicable algorithm with regret O~(k32T12+2Ω(B))\widetilde{O}(k^{\frac{3}{2}}T^{{\frac{1}{2}}+2^{-\Omega(B)}}) for any integer B>1B>1. We show that result (3) is nearly tight by establishing there are no (k1,δ)(k-1,\delta)-list replicable algorithm with o(T)o(T)-regret, almost exactly matching kk-list replicable upper bound for B=2B=2. We further show that for linear bandits with dd-dimensional features, there is a O~(d2T1/2+2Ω(B))\widetilde{O}(d^2T^{1/2+2^{-\Omega(B)}})-regret algorithm with ((2d+1)B1,δ)((2d+1)^{B-1},\delta)-list replicability, for B>1B>1, even when the number of possible arms can be infinite.

372Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data

[openreview] [pdf]

Abstract Diffusion Transformer, the backbone of Sora for video generation, successfully scales the capacity of diffusion models, pioneering new avenues for high-fidelity sequential data generation. Unlike static data such as images, sequential data consists of consecutive data frames indexed by time, exhibiting rich spatial and temporal dependencies. These dependencies represent the underlying dynamic model and are critical to validate the generated data. In this paper, we make the first theoretical step towards bridging diffusion transformers for capturing spatial-temporal dependencies. Specifically, we establish score approximation and distribution estimation guarantees of diffusion transformers for learning Gaussian process data with covariance functions of various decay patterns. We highlight how the spatial-temporal dependencies are captured and affect learning efficiency. Our study proposes a novel transformer approximation theory, where the transformer acts to unroll an algorithm. We support our theoretical results by numerical experiments, providing strong evidence that spatial-temporal dependencies are captured within attention layers, aligning with our approximation theory.

373Inv-PnCO: Invariant Predict-and-Combinatorial Optimization under Distribution Shifts

[openreview] [pdf]

Abstract Machine learning has been well introduced to solve combinatorial optimization (CO) problems over the decade, while most works only consider the deterministic setting. Yet in real-world applications, decisions have often to be made in uncertain environments, which is typically reflected by the stochasticity of the coefficients of the problem at hand, considered as a special case of the more general and emerging “predict-and-optimize” (PnO) paradigm in the sense that the prediction and optimization are jointly learned and performed. In this paper, we consider the problem of learning to solve CO under the above uncertain setting and formulate it as “predict-and-combinatorial optimization” (PnCO), particularly in a challenging yet practical out-of-distribution (OOD) setting, where there is a distribution shift between training and testing CO instances. We propose the Invariant Predict-and-Combinatorial Optimization (Inv-PnCO) framework to alleviate this challenge. Inv-PnCO derives a learning objective that reduces the distance of distribution of solutions with the true distribution and uses a regularization term to learn invariant decision-oriented factors that are stable under various environments, thereby enhancing the generalizability of predictions and subsequent optimizations. We also provide a theoretical analysis of how the proposed loss reduces OOD error. The empirical evaluation across three distinct tasks on knapsack, visual shortest path planning, and traveling salesman problem covering array, image, and graph inputs underscores the efficacy of Inv-PnCO to enhance the generalizability, both for predict-then-optimize and predict-and-optimize approaches.

374A Causal Lens for Learning Long-term Fair Policies

[openreview] [pdf]

Abstract Fairness-aware learning studies the development of algorithms that avoid discriminatory decision outcomes despite biased training data. While most studies have concentrated on immediate bias in static contexts, this paper highlights the importance of investigating long-term fairness in dynamic decision-making systems while simultaneously considering instantaneous fairness requirements. In the context of reinforcement learning, we propose a general framework where long-term fairness is measured by the difference in the average expected qualification gain that individuals from different groups could obtain. Then, through a causal lens, we decompose this metric into three components that represent the direct impact, the delayed impact, as well as the spurious effect the policy has on the qualification gain. We analyze the intrinsic connection between these components and an emerging fairness notion called benefit fairness that aims to control the equity of outcomes in decision-making. Finally, we develop a simple yet effective approach for balancing various fairness notions.

375VideoGuide: Improving Video Diffusion Models without Training Through a Teacher’s Guide

[openreview] [pdf]

Abstract Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model’s denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page:https://videoguide2025.github.io/

376A Defense of One-Step Learning: Examining Single-Batch Distillations

[openreview] [pdf]

Abstract Dataset distillation produces a compressed synthetic dataset that approximates a large dataset or other learning task. A model can be trained on a distillation in a single gradient descent step. Conventional wisdom suggests that single-step learning is not generalizable and should yield poor performance; yet, distillation defies these expectations with good approximations of full direct-task training for a large distribution of models. In order to understand how distilled datasets can perform one-shot learning, we examine the distilled data instances and the cost surfaces produced by the distilled datasets. We demonstrate that the distilled dataset not only mimics features of the true dataset but also produces cost surfaces such that one-step training leads models from the initialization space into local minima of the true task’s cost surface. This shows how one-step learning’s counter-intuitive success is not only reasonable but also the expected outcome of dataset distillation.

377Learning by Causality to Improve Channel Dependency Modeling in Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Beyond the conventional long-term temporal dependency modeling, multivariate time series (MTS) forecasting has rapidly shifted toward channel dependency (CD) modeling. This shift significantly improves modeling quality by fully leveraging both multivariate relationships and temporal dependencies. Recent methods primarily model channel dependency through correlation learning (e.g., crossattention) or non-trainable statistical techniques (e.g., cross-correlation). However, these approaches struggle to fully capture the intrinsic relationships within MTS, particularly those stemming from directed cause-effect (i.e., causality) and nonstationary variates originating from diverse sources. In addition, causality may arise from the signals with different temporal behaviors, such as varying periodicity or discrete event sequences, which is not sufficiently discussed before. In this paper, we propose CALAS (Causality-enhanced Attention with Learnable and Adaptive Spacing), the first end-to-end learning method for MTS forecasting that uncover causality among variates without relying on statistical measures or prior knowledge. To model underlying causality, which consists of causal strength and propagation delay, we newly design a hypernetworks-based 1D convolutions mechanism. Inspired by dilated convolution with learnable spacings (DCLS) and spiking neural networks (SNNs), we extend discrete time delay into a continuous Gaussian kernel. Combining the hypernetworks-generated Gaussian kernel and convolutional weights (i.e., attention or causal strength), we achieve the end-to-end dynamic causality modeling mechanism. This mechanism enhances the model’s ability to capture time-varying causality across multi-source variates, ultimately improving the prediction accuracy, quality, and interpretability. For evaluation, we conduct extensive experiments with six real-world datasets and qualitative analysis to demonstrate CALAS’s superiority in capturing varying causality in a data-agnostic manner. The experiment results indicate that CALAS has significantly improved MTS forecasting accuracy compared to state-of-the-art methods by dynamically modeling causality among variates.

378EraseDiff: Erasing Data Influence in Diffusion Models

[openreview] [pdf]

Abstract We introduce EraseDiff, an unlearning algorithm designed for diffusion models to address concerns related to data memorization. Our approach formulates the unlearning task as a constrained optimization problem, aiming to preserve the utility of the diffusion model on retained data while removing the information associated with the data to be forgotten. This is achieved by altering the generative process to deviate away from the ground-truth denoising procedure. To manage the computational complexity inherent in the diffusion process, we develop a first-order method for solving the optimization problem, which has shown empirical benefits. Extensive experiments and thorough comparisons with state-of-the-art algorithms demonstrate that EraseDiff effectively preserves the model’s utility, efficacy, and efficiency.

379Process-Driven Autoformalization in Lean 4

[openreview] [pdf]

Abstract Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a large-scale dataset \textbf{Form}alization for \textbf{L}ean~\textbf{4} (\textbf{\dataset}) designed to comprehensively evaluate the autoformalization capabilities of large language models (LLMs), encompassing both statements and proofs in natural and formal languages. Additionally, we introduce the \textbf{P}rocess-\textbf{D}riven \textbf{A}utoformalization (\textbf{\method}) framework that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Extensive experiments demonstrate that \method improves autoformalization, enabling higher compiler accuracy and human-evaluation scores using less filtered training data. Moreover, when fine-tuned with data containing detailed process information, \method exhibits enhanced data utilization, resulting in more substantial improvements in autoformalization for Lean 4.

380Bayesian Active Learning By Distribution Disagreement

[openreview] [pdf]

Abstract Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.

381LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

[openreview] [pdf]

Abstract Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, latent re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer’s performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations.

382ϕ-Update: A Class of Policy Update Methods with Policy Convergence Guarantee

[openreview] [pdf]

Abstract Inspired by the similar update pattern of softmax natural policy gradient and Hadamard policy gradient, we propose to study a general policy update rule called ϕ-update, where ϕ refers to a scaling function on advantage functions. Under very mild conditions on ϕ, the global asymptotic state value convergence of ϕ-update is firstly established. Then we show that the policy produced by ϕ-update indeed converges, even when there are multiple optimal policies. This is in stark contrast to existing results where explicit regularizations are required to guarantee the convergence of the policy. Since softmax natural policy gradient is an instance of ϕ-update, it provides an affirmative answer to the question whether the policy produced by softmax natural policy gradient converges. The exact asymptotic convergence rate of state values is further established based on the policy convergence. Lastly, we establish the global linear convergence of ϕ-update.

383Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

[openreview] [pdf]

Abstract Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find proper checkpoint merging can significantly improve the training convergence and final performance. Specifically, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: (a) Reducing training cost. With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23× on CIFAR-10 and 15× on ImageNet-64). (b) Enhancing pre-trained models. When full training is already done, LCSC can further improve the generation quality or efficiency of the final converged models. For example, LCSC achieves better FID using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality. Applying LCSC to large text-to-image models, we also observe clearly enhanced generation quality.

384Long-tailed Adversarial Training with Self-Distillation

[openreview] [pdf]

Abstract Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets. Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods. In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions. Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique. Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset. Our extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets. We improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.

385Guided-BFNs: Towards Visualizing and Understanding Bayesian Flow Networks in the Context of Trajectory Planning

[openreview] [pdf]

Abstract Bayesian Flow Networks (BFNs) represent an emerging class of generative models that exhibit promising capabilities in modeling continuous, discretized, and discrete data. In this paper, we develop Guided-BFNs to integrate BFNs with conditional guidance and gradient guidance to facilitate the effective application of such models in trajectory planning tasks. Based on our developments, we can better comprehend BFNs by inspecting the generation dynamics of the planning trajectories. Through extensive parameter tuning and rigorous ablation experiments, we systematically delineate the functional roles of various parameters and elucidate the pivotal components within the structure of BFNs. Furthermore, we conduct a comparative analysis of the planning results between diffusion models and BFNs, to discern their similarities and differences. Additionally, we undertake efforts to augment the performance of BFNs, including developing a faster and training-free sampling algorithm for sample generation. Our objectives encompass not only a comprehensive exploration of BFNs’ structural insights but also the enhancement of their practical utility.

386DET: Learn to Solve the Tunnel Traveling Salesmen Problem using Double-Encoder Transformer

[openreview] [pdf]

Abstract We delve into a challenging variant of the Traveling Salesman Problem (TSP), namely tunnel TSP, which incorporates a new important constraint requiring the traversal of a prescribed set of tunnels. While traditional deep reinforcement learning (DRL) based neural TSP algorithms excel in optimizing routes without tunnel restrictions, they often struggle to achieve optimal performance in tunnel TSP due to the neglect of the crucial role of tunnel attributes during solution generation. To address this challenge, we propose a simple but effective and flexible technique, called Double-Encoder Transformer (DET), which can be seamlessly integrated into various existing autoregressive neural TSP solvers. DET processes node and tunnel location information separately and encodes them in two distinct feature spaces. Following an efficient fusion strategy, DET then integrates the encoded information from nodes and tunnels, harnessing their intricate interactions. Experimental validation demonstrates that integrating DET into existing autoregressive neural solvers significantly improves performance, enabling us to reduce the average optimality gap for tunnel TSP from 12.58% (of the previous Single-Encoder model) to 7.35%.

387Generating Model Parameters for Controlling: Parameter Diffusion for Controllable Multi-Task Recommendation

[openreview] [pdf]

Abstract Commercial recommender systems face the challenge that task requirements from platforms or users often change dynamically (e.g., varying preferences for accuracy or diversity). Ideally, the model should be re-trained after resetting a new objective function, adapting to these changes in task requirements. However, in practice, the high computational costs associated with retraining make this process impractical for models already deployed to online environments. This raises a new challenging problem: how to efficiently adapt the learning model to different task requirements by controlling model parameters after deployment, without the need for retraining. To address this issue, we propose a novel controllable learning approach via Parameter Diffusion for controllable multi-task Recommendation (PaDiRec), which allows the customization and adaptation of recommendation model parameters to new task requirements without retraining. Specifically, we first obtain the optimized model parameters through adapter tunning based on the feasible task requirements. Then, we utilize the diffusion model as a parameter generator, employing classifier-free guidance in conditional training to learn the distribution of optimized model parameters under various task requirements. Finally, the diffusion model is applied to effectively generate model parameters in a test-time adaptation manner given task requirements. As a model-agnostic approach, PaDiRec can leverage existing recommendation models as backbones to enhance their controllability. Extensive experiments on public datasets and a dataset from a commercial app, indicate that PaDiRec can effectively enhance controllability through efficient model parameter generation. The code is released athttps://anonymous.4open.science/r/PaDiRec-DD13e.

388Characterizing Context Influence and Hallucination in Summarization

[openreview] [pdf]

Abstract Although Large Language Models (LLMs) have achieved remarkable performance in numerous downstream tasks, their ubiquity has raised two significant concerns. One is that LLMs can hallucinate by generating content that contradicts relevant contextual information; the other is that LLMs can inadvertently leak private information due to input regurgitation. Many prior works have extensively studied each concern independently, but none have investigated them simultaneously. Furthermore, auditing the influence of provided context during open-ended generation with a privacy emphasis is understudied. To this end, we comprehensively characterize the influence and hallucination of contextual information during summarization. We introduce a definition for context influence and Context-Influence Decoding (CID), and then we show that amplifying the context (by factoring out prior knowledge) and the context being out of distribution with respect to prior knowledge increases the context’s influence on an LLM. Moreover, we show that context influence gives a lower bound of the private information leakage of CID. We corroborate our analytical findings with experimental evaluations that show improving the F1 ROGUE-L score on CNN-DM for LLaMA 3 by 10\textbf{10}% over regular decoding also leads to 1.5x\textbf{1.5x} more influence by the context. Moreover, we empirically evaluate how context influence and hallucination are affected by (1) model capacity, (2) context size, (3) the length of the current response, and (4) different token nn-grams of the context.

389Which Algorithms Have Tight Generalization Bounds?

[openreview] [pdf]

Abstract We study which machine learning algorithms have tight generalization bounds. First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm’s loss.

390Frequency-Decoupled Cross-Modal Knowledge Distillation

[openreview] [pdf]

Abstract Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observe that low-frequency features tend to capture modality-agnostic, generalizable information, while high-frequency features are more modality-specific. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. Additionally, we propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show that our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches.

391Learning Transferable Sub-goals by Hypothesizing Generalizing Features

[openreview] [pdf]

Abstract Although transfer is a key promise of hierarchical reinforcement learning, current methods discover nontransferable skills. Typically, skills are defined over all state features simultaneously, preventing generalization as some state features reliably support generalization while others do not. For an agent to effectively transfer a skill it must identify features that generalize and define the skill over this subset. However, this task is under-specified as the agent has no prior knowledge of what future tasks may be introduced. Since successful transfer requires a skill to reliably achieve a sub-goal from different states, we focus our attention on ensuring sub-goals are represented in a transferable way. For each sub-goal, we train an ensemble of classifiers while explicitly incentivizing them to use minimally overlapping features. Each ensemble member represents a unique hypothesis about the transferable features of a sub-goal that the agent can use to learn a skill in previously unseen portions of the environment. Environment reward then determines which hypothesis is most transferable for the given task, based on the intuition that transferable sub-goals lead to better reward maximization. We apply these reusable sub-goals to MiniGrid and Montezuma’s Revenge, allowing us to relearn previously defined skills in unseen parts of the state-space.

392A Super-Aligned Driving Generalist Is Your Cockpit

[openreview] [pdf]

Abstract The intelligent driving cockpit, an important part of intelligent driving, needs to match different users’ comfort, interaction, and safety needs. This paper aims to build a \textbf{s}uper-\textbf{a}ligned and \textbf{ge}neralist \textbf{dr}iving agent, \textbf{sage deer}. Sage Deer achieves two highlights: (1) Super alignment: It achieves different reactions according to different people’s preferences and biases. (2) Generalist: It can understand the user’s physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Multimodal: He can understand RGB, NIR, and depth video to build more robust perception, understanding, and reasoning. To achieve the above requirements, we design retrieval-enhanced multimodal frameworks. We collected multiple data sets and built a large-scale benchmark. This benchmark measures the sage deer’s perceptual decision-making ability and the super alignment’s accuracy.

393SATCH: Specialized Assistant Teacher Distillation to Reduce Catastrophic Forgetting

[openreview] [pdf]

Abstract Continual learning enables models to learn new tasks sequentially without forgetting previously learned knowledge. Knowledge distillation reduces forgetting by using a single teacher model to transfer previous knowledge to the student model. However, existing methods face challenges, specifically loss of task-specific knowledge, limited diversity in the transferred knowledge, and delays in teacher availability. These issues stem from self-distillation, where the teacher is a mere snapshot of the student after learning a new task, inheriting the student’s biases and becoming available only after learning a task. We propose Specialized Assistant TeaCHer distillation (SATCH), a novel method that uses a smaller assistant teacher trained exclusively on the current task. By incorporating the assistant teacher early in the learning process, SATCH provides task-specific guidance, improves the diversity of transferred knowledge, and preserves critical task-specific insights. Our method integrates seamlessly with existing knowledge distillation techniques, and experiments on three standard continual learning benchmarks show that SATCH improves accuracy by up to 12% when combined with four state-of-the-art methods. Code is available in supplementary materials.

394Scaling Diffusion Language Models via Adaptation from Autoregressive Models

[openreview] [pdf]

Abstract Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions.

395Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

[openreview] [pdf]

Abstract Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains.

396Latent Diffusion with LLMs for Reasoning

[openreview] [pdf]

Abstract Despite the widespread adoption of large language models with hundreds of billions of parameters, these models still struggle on complex reasoning benchmarks. In this paper, we argue that the autoregressive nature of current language models are not suited for reasoning due to fundamental limitations, and that reasoning requires slow accumulation of knowledge through time. We show that combining latent diffusion models with an encoder-decoder transformer architecture provides a scalable way to address some of the fundamental shortcomings posed by autoregressive models. Diffusion models can arrive at predictions through many forward passes in latent space, and their reasoning is not handicapped by the order of the tokens in the dataset. Through our experiments, we show that latent diffusion language models is a feasible approach towards scalable language models that have general complex reasoning abilities.

397Adversarial Generative Flow Network for Solving Vehicle Routing Problems

[openreview] [pdf]

Abstract Recent research into solving vehicle routing problems (VRPs) has gained significant traction, particularly through the application of deep (reinforcement) learning for end-to-end solution construction. However, many current construction-based neural solvers predominantly utilize Transformer architectures, which can face scalability challenges and struggle to produce diverse solutions. To address these limitations, we introduce a novel framework beyond Transformer-based approaches, i.e., Adversarial Generative Flow Networks (AGFN). This framework integrates the generative flow network (GFlowNet)—a probabilistic model inherently adept at generating diverse solutions (routes)—with a complementary model for discriminating (or evaluating) the solutions. These models are trained alternately in an adversarial manner to improve the overall solution quality, followed by a proposed hybrid decoding method to construct the solution. We apply the AGFN framework to solve the capacitated vehicle routing problem (CVRP) and travelling salesman problem (TSP), and our experimental results demonstrate that AGFN surpasses the popular construction-based neural solvers, showcasing strong generalization capabilities on synthetic and real-world benchmark instances.

398Diffusion Trajectory-guided Policy: A Novel Framework for Long-Horizon Robot Manipulation

[openreview] [pdf]

Abstract Recently, Vision-Language Models (VLMs) have made substantial progress in robot imitation learning, benefiting from increased amounts of demonstration data. However, the high cost of data collection remains a significant bottleneck, and the scarcity of demonstrations often result in poor generalization of the imitation policy, especially in long-horizon robotic manipulation tasks. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates task-relevant trajectories through a diffusion model to guide policy learning for long-horizon tasks. Furthermore, we demonstrate that our DTP method offers a useful interface for prompt engineering, providing a novel way to connect robot manipulation skills with interactions involving LLMs or humans. Our approach employs a two-stage training process: initially, we train a generative vision-language model to create diffusion task-relevant trajectories, then refine the imitation policy using these trajectories. We validate that the DTP method achieves substantial performance improvements in extensive experiments on the CALVIN simulation benchmark, starting from scratch without any external pretraining. Our approach outperforms state-of-the-art baselines by an average of 25% in success rate across various settings.

399Length Desensitization in Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO’s optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-like preferences. ”Brevity is the Soul of Wit.‘’—William Shakespeare

400Understanding Generalization of Preference Optimization Under Noisy Feedback

[openreview] [pdf]

Abstract As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic given the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We establish generalization guarantees for noisy preference learning under a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Our analysis provides the basis for a general model that closely describes how the generalization decays with the noise rate. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

401FDN: Interpretable Spatiotemporal Forecasting with Future Decomposition Networks

[openreview] [pdf]

Abstract Spatiotemporal systems comprise a collection of spatially distributed yet interdependent entities each generating unique dynamic signals. Highly sophisticated methods have been proposed in recent years delivering state-of-the-art (SOTA) forecasts but few have focused on interpretability. To address this, we propose the Future Decomposition Network (FDN), a novel forecast model capable of (a) providing interpretable predictions through classification (b) revealing latent activity patterns in the target time-series and (c) delivering forecasts competitive with SOTA methods at a fraction of their memory and runtime cost. We conduct comprehensive analyses on FDN for multiple datasets from hydrologic, traffic, and energy systems demonstrating its improved accuracy and interpretability.

402Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) is essential when online exploration is costly or unsafe, but it often struggles with high epistemic uncertainty due to limited data. Existing methods learn fixed conservative policies, which limit adaptivity and generalization. To tackle these challenges, we proposeReflect-then-Plan (RefPlan), a noveldoubly Bayesianapproach for offline model-based (MB) planning that enhances offline-learned policies for improved adaptivity and generalization. RefPlan integrates uncertainty modeling and MB planning in a unified probabilistic framework, recasting planning as Bayesian posterior estimation. During deployment, it updates a belief distribution over environment dynamics based on real-time observations. By incorporating this uncertainty into MB planning via marginalization, RefPlan derives plans that account for unknowns beyond the agent’s limited knowledge. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

403Anomaly Detection through Conditional Diffusion Probability Modeling on Graphs

[openreview] [pdf]

Abstract Existing Graph Neural Network-based anomaly detection methods suffer from over-smoothing issues during feature aggregation. Moreover, most existing methods are discriminative models that learn the boundaries between anomalous and normal data points, allowing malicious nodes in a dynamic adversarial environment to bypass detection boundaries. To address these issues, existing methods primarily focus on enhancing the discriminative boundary for each individual node, rather than considering the interdependencies of node anomalies from a holistic graph perspective. We propose an advanced Conditional Graph Anomaly Diffusion Model (CGADM) to model and capture the joint distribution of anomalies on the whole graph, thereby enabling generative graph anomaly detection. To avoid starting the diffusion process from a random state, CGADM introduces a prior-guided denoising diffusion probability model. To circumvent the need for iterative denoising samplings for each node on large-scale graphs, we adopt a prior confidence-aware mechanism to dynamically adjust the reverse sampling steps for each node, significantly reducing the computational burden on large-scale graphs. We conducted experiments on CGADM using standard benchmarks, and the results demonstrated excellent performance in graph anomaly detection tasks. Additional ablation studies confirmed our framework’s computational advantages.

404Targeted Attack Improves Protection against Unauthorized Diffusion Customization

[openreview] [pdf]

Abstract Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization.

405Diffusion Minimization and Sheaf Neural Networks for Recommender Systems

[openreview] [pdf]

Abstract Graph Neural Networks (GNN) are well-known for successful applications in recommender systems. Despite recent advances in GNN development, various authors report that in certain cases GNN suffer from so-called oversmoothing problems. Sheaf Neural Networks (SNN) is one of the ways to address the issue of oversmoothing. In the present work we propose a novel approach for training SNN together with user and item embeddings. In that approach parameters of the sheaf are inferred via minimization of the classical BPR loss and sheaf diffusion on graphs subjected to orthogonality and consistency constraints. Performance of the novel technique is evaluated on synthetic test cases and standard benchmarks for recommendations.

406Learning Augmentation Policies from A Model Zoo for Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting models typically rely on a fixed-size training set and treat all data uniformly, which may not effectively capture the specific patterns present in more challenging training samples. To address this issue, we introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning. Our approach begins with an empirical analysis to determine which parts of the training data should be augmented. Specifically, we identify the so-called marginal samples by considering the prediction diversity across a set of pretrained forecasting models. Next, we propose using variational masked autoencoders as the augmentation model and applying the REINFORCE algorithm to transform the marginal samples into new data. The goal of this generative model is not only to mimic the distribution of real data but also to reduce the variance of prediction errors across the model zoo. By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance, advancing the prior art in this field with minimal additional computational cost.

407One Training Fits All: Generalized Data Condensation via Mixture-of-Information Bottleneck Guidance

[openreview] [pdf]

Abstract Data condensation (DC) technologies are widely used in buffer-constrained scenarios to reduce the memory demand of training samples and maintain DNN training performance. However, due to the storage constraint of deployment devices and the high energy costs of condensation procedure, synthetic datasets generated by DC often have inferior performance in terms of training efficiency and scalability, which greatly limits its practical application on various edge devices. This dilemma arises due to two reasons: i) existing state-of-the-art (SoTA) data condensation approaches that update synthetic datasets by intuitively matching intermediate training outputs (e.g., gradients, features and distributions) between real datasets and synthetic datasets without improving their representational information capabilities from the perspective of the useful information contained. ii) DC lacks sufficient consideration for the heterogeneity of storage constraints among various edge devices, which will result in large training overheads (i.e., consumption or storage). To tackle the above issue, We propose a novel method named Mixture-of-Information Bottleneck Dataset Condensation (MIBDC), which employs information bottlenecks from synthetic datasets with various Image Per Class (IPC) numbers to improve the overall DC generalization and scalability. Specifically, in this paper, the following two phenomena are found: i) The quality of synthetic datasets improves with increased synthetic dataset quantity. ii) The smaller the number of synthetic datasets, the earlier they can reach the convergence peak. Based on the above two findings, this paper proposes that i) large synthetic datasets can guide the better convergence of smaller ones. ii) information contained in synthetic datasets with different IPC numbers can play a collaborative role in the guidance of dataset condensation generalization. Comprehensive experimental results on three well-known datasets show that, compared with state-of-the-art dataset condensation methods, MIBDC can not only enhance the generalization performance of trained models but also achieve superior scalability.

408HDDI: A Historical Data-Based Diffusion Imputation Method for High-Accuracy Recovery in Multivariate Time Series with High Missing Rate and Long-Term Gap

[openreview] [pdf]

Abstract Multivariate time series data often face the challenge of missing values, which can impact the performance of subsequent tasks. Although some deep learning-based imputation methods perform well, they still struggle with insufficient training data due to high missing rate and long-term missing data. To address these challenges, we propose a Historical Data-based Multivariate Time Series Diffusion Imputation (HDDI) method. Unlike existing deep learning-based imputation methods, we design a historical data supplement module to match and fuse historical data to supplement the training data. Additionally, we propose a diffusion imputation module that utilizes the supplement training data to achieve high-accuracy imputation even under high missing rate and long-term missing scenario. We conduct extensive experiments on five public multivariate time series datasets, the results show that our HDDI outperforms baseline methods across five datasets. Particularly, when the data missing rate is 90%, HDDI improves accuracy by 25.15% compared to the best baseline method in the random missing scenario, and by 13.64% in the long-term missing scenario. The code is available athttps://github.com/liuyu3880/HDDIproject.

409Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark

[openreview] [pdf]

Abstract Multi Scenario Recommendation (MSR) tasks, referring to building a unified model to enhance performance across all recommendation scenarios, have recently gained much attention. However, current research in MSR faces two significant challenges that hinder the field’s development: the absence of uniform procedures for multi-scenario dataset processing, thus hindering fair comparisons, and most models being closed-sourced, which complicates comparisons with current SOTA models. Consequently, we introduce our benchmark, Scenario-Wise Rec, which comprises six public datasets and twelve benchmark models, along with a training and evaluation pipeline. We have also validated our benchmark using an industrial advertising dataset, further enhancing its real-world reliability. We aim for this benchmark to provide researchers with valuable insights from prior works, enabling the development of novel models based on our benchmark and thereby fostering a collaborative research ecosystem in MSR. Our source code is also available.

410RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models

[openreview] [pdf]

Abstract Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses.To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models (code and data are available in the submitted supplemental materials).

411Prototype-based Optimal Transport for Out-of-Distribution Detection

[openreview] [pdf]

Abstract Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between ID and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Experiments demonstrate the superiority of our method over state-of-the-art methods.

412Pullback Flow Matching on Data Manifolds

[openreview] [pdf]

Abstract We propose Pullback Flow Matching (PFM), a novel framework for generative modeling on data manifolds. Unlike existing methods that assume or learn restrictive closed-form manifold mappings for training Riemannian Flow Matching (RFM) models, PFM leverages pullback geometry and isometric learning to preserve the underlying manifold’s geometry while enabling efficient generation and precise interpolation in latent space. This approach not only facilitates closed-form mappings on the data manifold but also allows for designable latent spaces, using assumed metrics on both data and latent manifolds. By enhancing isometric learning through Neural ODEs and proposing a scalable training objective, we achieve a latent space more suitable for interpolation, leading to improved manifold learning and generative performance. We demonstrate PFM’s effectiveness through applications in synthetic data, protein dynamics and protein sequence data, generating novel proteins with specific properties. This method shows strong potential for drug discovery and materials science, where generating novel samples with specific properties is of great interest.

413Alignment without Over-optimization: Training-Free Solution for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models excel in generative tasks, but aligning them with specific objec- tives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to effectively optimize target rewards. Addressing these limita- tions, we propose a training-free sampling method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box opti- mization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities.

414AN INFORMATION THEORETIC EVALUATION METRIC FOR STRONG UNLEARNING

[openreview] [pdf]

Abstract Machine unlearning (MU) aims to remove the influence of specific data from trained models, addressing privacy concerns and ensuring compliance with regulations such as the “right to be forgotten.” Evaluating strong unlearning, where the unlearned model is indistinguishable from one retrained without the forgetting data, remains a significant challenge in deep neural networks (DNNs). Common black-box metrics, such as variants of membership inference attacks and accuracy comparisons, primarily assess model outputs but often fail to capture residual information in intermediate layers. To bridge this gap, we introduce the Information Difference Index (IDI), a novel white-box metric inspired by information theory. IDI quantifies retained information in intermediate features by measuring mutual information between those features and the labels to be forgotten, offering a more comprehensive assessment of unlearning efficacy. Our experiments demonstrate that IDI effectively measures the degree of unlearning across various datasets and architectures, providing a reliable tool for evaluating strong unlearning in DNNs.

415A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts

[openreview] [pdf]

Abstract There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.

416T-Graphormer: Using Transformers for Spatiotemporal Forecasting

[openreview] [pdf]

Abstract Time series data is ubiquitous and appears in all fields of study. In multivariate time series, observations are interconnected both temporally and across components. For instance, in traffic flow analysis, traffic speeds at different intersections exhibit complex spatiotemporal correlations. This dual structure presents unique challenges for modelling. Most existing forecasting methods address this by learning the spatial and temporal dependencies separately. Here, we propose Temporal Graphormer (T-Graphormer), a transformer-based method that models spatiotemporal correlations directly. By extending the Graphormer architecture over time, each node is updated based on all other nodes within the historical context window, allowing the model to learn powerful representations. We demonstrate the efficacy of T-Graphormer by evaluating it on two real-world traffic prediction benchmarking datasets. Compared to state-of-the-art methods, our method shows a reduction in root mean squared error (RMSE) by up to 10% and mean absolute percentage error (MAPE) by up to 10%.

417Hydra-MDP++: Advancing End-to-End Driving via Hydra-Distillation with Expert-Guided Decision Analysis

[openreview] [pdf]

Abstract We introduce HydraMDP++, a novel end-to-end autonomous driving framework that integrates rule-based and neural planners by learning from human demonstrations and distilling knowledge from rule-based experts. We propose a teacher-student knowledge distillation framework with a multi-head student decoder that integrates feedback from rule-based expert teachers. The student model achieves state-of-the-art performance on the NAVSIM benchmark with a tiny image encoder. Moreover, to address limitations in existing evaluation metrics, we expand the teacher model to include traffic light compliance, lane-keeping ability, and extended comfort. This is intended to ensure a more robust decision synthesis in driving. HydraMDP++ demonstrates robust and efficient performance across diverse driving scenarios, achieving a 91.0% drive score on NAVSIM by simply scaling the image encoder. Our work contributes to developing more reliable and adaptable autonomous driving systems that combine the strengths of rule-based and neural planning approaches.

418GRADIENT-OPTIMIZED CONTRASTIVE LEARNING

[openreview] [pdf]

Abstract Contrastive learning is a crucial technique in representation learning, producing robust embeddings by distinguishing between similar and dissimilar pairs. In this paper, we introduce a novel framework, Gradient-Optimized Contrastive Learning (GOAL), which enhances network training by optimizing gradient updates during backpropagation as a bilevel optimization problem. Our approach offers three key insights that set it apart from existing methods: (1) Contrastive learning can be seen as an approximation of a one-class support vector machine (OC-SVM) using multiple neural tangent kernels (NTKs) in the network’s parameter space; (2) Hard triplet samples are vital for defining support vectors and outliers in OC-SVMs within NTK spaces, with their difficulty measured using Lagrangian multipliers; (3) Contrastive losses like InfoNCE provide efficient yet dense approximations of sparse Lagrangian multipliers by implicitly leveraging gradients. To address the computational complexity of GOAL, we propose a novel contrastive loss function, Sparse InfoNCE (SINCE), which improves the Lagrangian multiplier approximation by incorporating hard triplet sampling into InfoNCE. Our experimental results demonstrate the effectiveness and efficiency of SINCE in tasks such as image classification and point cloud completion. Demo code is attached in the supplementary file.

419G-Transformer for Conditional Average Potential Outcome Estimation over Time

[openreview] [pdf]

Abstract Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task either (1) do not perform proper adjustments for time-varying confounders, or (2) suffer from large estimation variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model which adjusts for time-varying confounders, and provides low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

420Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

[openreview] [pdf]

Abstract Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the “saturation event”. We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

421Repulsive Latent Score Distillation for Solving Inverse Problems

[openreview] [pdf]

Abstract Score Distillation Sampling (SDS) has been pivotal for leveraging pre-trained diffusion models in downstream tasks such as inverse problems, but it faces two major challenges: (i)(i) mode collapse and (ii)(ii) latent space inversion, which become more pronounced in high-dimensional data. To address mode collapse, we introduce a novel variational framework for posterior sampling. Utilizing the Wasserstein gradient flow interpretation of SDS, we propose a multimodal variational approximation with a \emph{repulsion} mechanism that promotes diversity among particles by penalizing pairwise kernel-based similarity. This repulsion acts as a simple regularizer, encouraging a more diverse set of solutions. To mitigate latent space ambiguity, we extend this framework with an \emph{augmented} variational distribution that disentangles the latent and data. This repulsive augmented formulation balances computational efficiency, quality, and diversity. Extensive experiments on linear and nonlinear inverse tasks with high-resolution images (512×512512 \times 512) using pre-trained Stable Diffusion models demonstrate the effectiveness of our approach.

422Overcoming Lookback Window Limitations: Exploring Longer Windows in Long-Term Time Series Forecasting

[openreview] [pdf]

Abstract Long-term time series forecasting (LTSF) aims to predict future trends based on historical data. While longer lookback windows theoretically provide more comprehensive insights, current Transformer-based models face the Lookback Window Limitation (LWL). On one hand, longer windows introduce redundant information, which can hinder model learning. On the other hand, Transformers tend to overfit temporal noise rather than extract meaningful temporal information when dealing with longer sequences, compounded by their quadratic complexity. In this paper, we aim to overcome LWL, enabling models to leverage more historical information for improved performance. Specifically, to mitigate information redundancy, we introduce the Information Bottleneck Filter (IBF), which applies information bottleneck theory to extract essential subsequences from the input. Additionally, to address the limitations of the Transformer architecture in handling long sequences, we propose the Hybrid-Transformer-Mamba (HTM), which combines the linear complexity and long-range modeling capabilities of Mamba with the Transformer’s strength in modeling short sequences. We integrate these two model-agnostic modules into various existing methods and conduct experiments on seven datasets. The results demonstrate that incorporating these modules effectively overcomes the lookback window limitations. Notably, by combining them with the Patch strategy, we design the PIH (\textbf{P}atch-\textbf{I}BF-\textbf{H}TM), successfully extending the window length to 1024—a significantly larger window than previously achieved—and achieving state-of-the-art results, highlighting the potential of exploring even longer windows.

423Does learning the right latent variables necessarily improve in-context learning?

[openreview] [pdf]

Abstract Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.

424Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

[openreview] [pdf]

Abstract ControlNets are widely used for adding spatial control to text-to-image diffusion models. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for many users. Furthermore, applying ControlNets independently to different frames can not effectively maintain object temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models through the adaptation of pretrained ControlNets. Ctrl-Adapter offers strong and diverse capabilities, including image and video control, sparse-frame video control, fine-grained patch-level multi-condition control, zero-shot adaptation to unseen conditions, and supports a variety of downstream tasks beyond spatial control, including video editing, video style transfer, and text-guided motion control. With six diverse U-Net/DiT-based image/video diffusion models (SDXL, PixArt-α, I2VGen-XL, SVD, Latte, Hotshot-XL), Ctrl-Adapter matches the performance of pretrained ControlNets on COCO and achieves the state-of-the-art on DAVIS 2017 with significantly lower computation (< 10 GPU hours). We provide video examples inhttps://ctrladapterexamples.github.ioand code in the supplementary material.

[openreview] [pdf]

Abstract State-of-the-art link prediction (LP) models demonstrate impressive benchmark results. However, popular benchmark datasets often assume that training, validation, and testing samples are representative of the overall dataset distribution. In real-world situations, this assumption is often incorrect; since uncontrolled factors lead to the problem where new dataset samples come from different distributions than training samples. The vast majority of recent work focuses on dataset shift affecting node- and graph-level tasks, largely ignoring link-level tasks. To bridge this gap, we introduce a novel splitting strategy, known as LPShift, which utilizes structural properties to induce a controlled distribution shift. We verify the effect of LPShift through empirical evaluation of SOTA LP methods on 16 LPShift generated splits of Open Graph Benchmark (OGB) datasets. When benchmarked with LPShift datasets, GNN4LP methods frequently generalize worse than heuristics or basic GNNs. Furthermore, LP-specific generalization techniques do little to improve performance under LPShift. Finally, further analysis provides insight on why LP models lose much of their architectural advantages under LPShift.

[openreview] [pdf]

Abstract Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason Dynamically via Optimal reasoning Trajectories Search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

427Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

[openreview] [pdf]

Abstract Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.

428Training on more Reachable Tasks for Generalisation in Reinforcement Learning

[openreview] [pdf]

Abstract In multi-task reinforcement learning, agents train on a fixed set of tasks and have to generalise to new ones. Recent work has shown that increased exploration improves this generalisation, but it remains unclear why exactly that is. In this paper, we introduce the concept of reachability in multi-task reinforcement learning and show that an initial exploration phase increases the number of reachable tasks the agent is trained on. This, and not the increased exploration, is responsible for the improved generalisation, even to unreachable tasks. Inspired by this, we propose a novel method Explore-Go that implements such an exploration phase at the beginning of each episode. Explore-Go only modifies the way experience is collected and can be used with most existing on-policy or off-policy reinforcement learning algorithms. We demonstrate the effectiveness of our method when combined with some popular algorithms and show an increase in generalisation performance across several environments.

429An Online Learning Theory of Trading-Volume Maximization

[openreview] [pdf]

Abstract We explore brokerage between traders in an online learning framework. At any round tt, two traders meet to exchange an asset, provided the exchange is mutually beneficial. The broker proposes a trading price, and each trader tries to sell their asset or buy the asset from the other party, depending on whether the price is higher or lower than their private valuations. A trade happens if one trader is willing to sell and the other is willing to buy at the proposed price. Previous work provided guidance to a broker aiming at enhancing traders’ total earnings by maximizing thegain from trade, defined as the sum of the traders’ net utilities after each interaction. This classical notion of reward can be highly unfair to traders with small profit margins, and far from the real-life utility of the broker. For these reasons, we investigate how the broker should behave to maximize the trading volume, i.e., thetotal number of trades. We model the traders’ valuations as an i.i.d. process with an unknown distribution. If the traders’ valuations are revealed after each interaction (full-feedback), and the traders’ valuations cumulative distribution function (cdf) is continuous, we provide an algorithm achieving logarithmic regret and show its optimality up to constants. If only their willingness to sell or buy at the proposed price is revealed after each interaction (2-bit feedback), we provide an algorithm achieving poly-logarithmic regret when the traders’ valuations cdf is Lipschitz and show its near-optimality. We complement our results by analyzing the implications of dropping the regularity assumptions on the unknown traders’ valuations cdf. If we drop the continuous cdf assumption, the regret rate degrades to Θ(T)\Theta(\sqrt{T}) in the full-feedback case, where TT is the time horizon. If we drop the Lipschitz cdf assumption, learning becomes impossible in the 2-bit feedback case.

430An Online Learning Theory of Trading-Volume Maximization

[openreview] [pdf]

Abstract No absctract

431Accelerate High-Quality Diffusion Models with Inner Loop Feedback

[openreview] [pdf]

Abstract We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models’ inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF’s 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons.

432Entropy-Based Uncertainty Modeling for Trajectory Prediction in Autonomous Driving

[openreview] [pdf]

Abstract In autonomous driving, accurate motion prediction is essential for safe and efficient motion planning. To ensure safety, planners must rely on reliable uncertainties in the future behavior of surrounding agents, yet this aspect has received limited attention. This paper addresses the problem of uncertainty modeling in trajectory prediction. We adopt a holistic approach that focuses on uncertainty quantification, decomposition, and the influence of model composition. Our method is based on a theoretically-grounded information-theoretic approach to measure uncertainty, allowing us to decompose total uncertainty into its aleatoric and epistemic components. We conduct extensive experiments on the nuScenes dataset to assess how different model architectures and configurations affect uncertainty quantification and model robustness. Our analysis thoroughly explores the uncertainty quantification capabilities of several state-of-the-art prediction models, examining the relationship between uncertainty and prediction error in both in- and out-of-distribution scenarios, as well as robustness in out-of-distribution.

433Ensembling Diffusion Models via Adaptive Feature Aggregation

[openreview] [pdf]

Abstract The success of the text-guided diffusion model has inspired the development and release of numerous powerful diffusion models within the open-source community. These models are typically fine-tuned on various expert datasets, showcasing diverse denoising capabilities. Leveraging multiple high-quality models to produce stronger generation ability is valuable, but has not been extensively studied. Existing methods primarily adopt parameter merging strategies to produce a new static model. However, they overlook the fact that the divergent denoising capabilities of the models may dynamically change across different states, such as when experiencing different prompts, initial noises, denoising steps, and spatial locations. In this paper, we propose a novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages. Specifically, we design a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator that adaptive aggregates the block-wise intermediate features from multiple U-Net denoisers into a unified one. The core idea lies in dynamically producing an individual attention map for each model’s features by comprehensively considering various states. It is worth noting that only SABW is trainable with about 50 million parameters, while other models are frozen. Both the quantitative and qualitative experiments demonstrate the effectiveness of our proposed method.

434Prevalence of Negative Transfer in Continual Reinforcement Learning: Analyses and a Simple Baseline

[openreview] [pdf]

Abstract We argue that the negative transfer problem occurring when the new task to learn arrives is an important problem that needs not be overlooked when developing effective Continual Reinforcement Learning (CRL) algorithms. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on either mitigating plasticity loss of RL agents or enhancing the positive transfer in CRL scenario. To that end, we develop Reset & Distill (R&D), a simple yet highly effective baseline method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent’s online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert’s action probabilities. We carried out extensive experiments on long sequence of Meta World tasks and show that our simple baseline method consistently outperforms recent approaches, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.

435Inverse Flow and Consistency Models

[openreview] [pdf]

Abstract Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF’s utility in scientific problems. This work opens up the use of powerful generative models for denoising.

436Understanding the Stability-based Generalization of Personalized Federated Learning

[openreview] [pdf]

Abstract Despite great achievements in algorithm design for Personalized Federated Learning (PFL), research on the theoretical analysis of generalization is still in its early stages. Some recent theoretical results have investigated the generalization performance of personalized models under the problem setting and hypothesis in the convex condition, which do not consider the real iteration performance during the non-convex training. To further understand the testing performance from the theoretical perspective, we propose the first algorithm-matter generalization analysis with uniform stability for the typical PFL method Partial Model Personalization on smooth and non-convex objectives. In an attempt to distinguish the shared and personalized errors, we decouple the shared aggregation and the local fine-tuning progress and illustrate the interaction mechanism between the shared and personalized variables. The algorithm-matter generalization bounds analyze the impact of the trivial hyperparameters like learning steps and stepsizes as well as the communication modes in both Centralized and Decentralized PFL (C-PFL and D-PFL), which also concludes that C-PFL generalizes better than D-PFL. Combined with the convergence errors, we then obtain the excess risk analysis and establish the better early stopping point for the optimal population risk of PFL. Promising experiments on CIFAR dataset also corroborate our theoretical results.

437Dual-Model Defense: Safeguarding Diffusion Models from Membership Inference Attacks through Disjoint Data Splitting

[openreview] [pdf]

Abstract Diffusion models have demonstrated remarkable capabilities in image synthesis, but their recently proven vulnerability to Membership Inference Attacks (MIAs) poses a critical privacy concern. This paper introduces two novel and efficient approaches (DualMD and DistillMD) to protect diffusion models against MIAs while maintaining high utility. Both methods are based on training two separate diffusion models on disjoint subsets of the original dataset. DualMD then employs a private inference pipeline that utilizes both models. This strategy significantly reduces the risk of black-box MIAs by limiting the information any single model contains about individual training samples. The dual models can also generate “soft targets” to train a private student model in DistillMD, enhancing privacy guarantees against all types of MIAs. Extensive evaluations of DualMD and DistillMD against state-of-the-art MIAs across various datasets in white-box and black-box settings demonstrate their effectiveness in substantially reducing MIA success rates while preserving competitive image generation performance. Notably, our experiments reveal that DistillMD not only defends against MIAs but also mitigates model memorization, indicating that both vulnerabilities stem from overfitting and can be addressed simultaneously with our unified approach.

438Constrained Exploitability Descent: Finding Mixed-Strategy Nash Equilibrium by Offline Reinforcement Learning

[openreview] [pdf]

Abstract This paper presents Constrained Exploitability Descent (CED), a novel model-free offline reinforcement learning algorithm for solving adversarial Markov games. CED is a game-theoretic approach combined with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find this side effect can be mitigated when it comes to solving adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games. The min-player policy at the stationary point satisfies the necessary condition for making up an exact mixed-strategy Nash equilibrium, even when the offline dataset is fixed and finite. Compared to the model-based method of Exploitability Descent that optimizes the max-player policy, our convergence result no longer relies on the generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that a single run of CED leads to an optimal min-player policy when the practical offline data guarantees uniform coverage. Besides, CED achieves significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform coverage.

439Attention Is All You Need For Mixture-of-Depths Routing

[openreview] [pdf]

Abstract Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanismA-MoDthat leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing,A-MoDallows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet and up to 2×2\times faster transfer learning, for the first time showing the benefits of MoD on various computer vision tasks.

440Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

[openreview] [pdf]

Abstract Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent’s performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS’s performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs’ reasoning and planning capabilities for agentic applications via test-time search and self-learning.

441RouteLLM: Learning to Route LLMs from Preference Data

[openreview] [pdf]

Abstract Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions.

442B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

[openreview] [pdf]

Abstract In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. Recently, the approach to self-improvement has shifted toward a more dynamic, online fashion through iterative training processes. However, the critical factors underlying the mechanism of these self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to explore and generate high-quality responses among multiple candidates (exploration); and (2) the reliability of external rewards in selecting the best responses from the generated outputs (exploitation). These factors are inherently moving targets throughout the self-improvement cycles, yet their dynamics are rarely discussed in prior research -- It remains unclear what impedes continual model enhancement after only a few iterations. Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model’s exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well due to shifts in distribution from the original policy. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-teaching effectiveness based on the current policy model and available rewards. Our experiments in mathematical reasoning demonstrate that B-STaR not only enhances the model’s exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance. Crucially, this work deconstructs the opaque nature of self-training algorithms, elucidating the interpretable dynamics throughout the process and highlighting current limitations for future research to address.

443Counterfactual Realizability

[openreview] [pdf]

Abstract It is commonly believed that, in a real-world environment, samples can only be drawn from observational and interventional distributions, corresponding to Layers 1 and 2 of the Pearl Causal Hierarchy. Layer 3, representing counterfactual distributions, is believed to be inaccessible almost by definition. However, Bareinboim, Forney, and Pearl (2015) introduced a procedure that allows an agent to sample directly from a counterfactual distribution, leaving open the question of what other counterfactual quantities can be estimated directly via physical experimentation. We resolve this by introducing a formal definition ofrealizability, the ability to draw samples from a distribution, and then developing a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable given fundamental physical constraints, such as the inability to go back in time and subject the same unit to a different experimental condition. We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning. While the baseline approach in these motivating settings typically follows an interventional or observational strategy, we show that a counterfactual strategy provably dominates both.

444How to Evaluate Reward Models for RLHF

[openreview] [pdf]

Abstract Reward models are critical to the LLM fine-tuning pipeline, serving as the proxy reference signal during Reinforcement Learning from Human Feedback (RLHF). As a result, the RLHF-ed model’s success strongly depends on the reward model’s ability to reproduce human preferences with high fidelity. However, this exact dependence is unknown, making it difficult to know which reward model is best. Undergoing a full RLHF training pipeline to directly probe downstream LLM performance, while the gold standard, is completely impractical given the resource-intensive nature of RLHF. To address this, we study downstream RLHF outcomes to create a predictive reward model evaluation. We ground our evaluations with our large-scale human preference and verifiable correctness preference datasets, compiling 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to RLHF outcomes, we launch a full end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance which we will open-source for public use and further development.

445Structured Diffusion Models with Mixture of Gaussians as Prior Distribution

[openreview] [pdf]

Abstract We propose a class of structured diffusion models, in which the prior distribution is chosen as a mixture of Gaussians, rather than a standard Gaussian distribution. The specific mixed Gaussian distribution, as prior, can be chosen to incorporate certain structured information of the data. We develop a simple-to-implement training procedure that smoothly accommodates the use of mixed Gaussian as prior. Theory is provided to quantify the benefits of our proposed models, compared to the classical diffusion models. Numerical experiments with synthetic, image and operational data are conducted to show comparative advantages of our model. Our method is shown to be robust to mis-specifications and in particular suits situations where training resources are limited or faster training in real time is desired.

446DP-SGD for non-decomposable objective functions

[openreview] [pdf]

Abstract Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as the contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their L2L_2 sensitivity grows with the batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions --- in particular, the commonly-used contrastive loss --- that manipulates gradients of the objective function in a novel way to obtain a sensitivity of the summed gradient that is O(1)O(1) for batch size nn. We test our DP-SGD variant on some CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method’s performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss.

447SATE: A Two-Stage Approach for Performance Prediction in Subpopulation Shift Scenarios

[openreview] [pdf]

Abstract Subpopulation shift refers to the difference in the distribution of subgroups between training and test datasets. When an underrepresented group becomes predominant during testing, it can lead to significant performance degradation, making performance prediction prior to deployment particularly important. Existing performance prediction methods often fail to address this type of shift effectively due to their usage of unreliable model confidence and mis-specified distributional distances. In this paper, we propose a novel performance prediction method specifically designed to tackle subpopulation shifts, called Subpopulation-Aware Two-stage Estimator (SATE). Our approach first estimates the subgroup proportions in the test set by linearly expressing the test embedding with training subgroup embeddings. Then, it predicts the accuracy for each subgroup using the accuracy on augmented training set, aggregating them into an overall performance estimate. We provide theoretical proof of our method’s unbiasedness and consistency, and demonstrate that it outperforms numerous baselines across various datasets, including vision, medical, and language tasks, offering a reliable tool for performance prediction in scenarios involving subpopulation shifts.

448Does Spatial Cognition Emerge in Frontier Models?

[openreview] [pdf]

Abstract Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.

449Distribution-Dependent Rates for Multi-Distribution Learning

[openreview] [pdf]

Abstract To address the needs of modeling uncertainty in sensitive machine learning applications, the setup of distributionally robust optimization (DRO) seeks good performance uniformly across a variety of tasks. The recent multi-distribution learning (MDL) framework \cite{pmlr-v195-awasthi23a-open-prob} tackles this objective in a dynamic interaction with the environment, where the learner has sampling access to each target distribution. Drawing inspiration from the field of pure-exploration multi-armed bandits, we provide \textit{distribution-dependent} guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size when compared to the existing distribution-independent analyses. We investigate two non-adaptive strategies, uniform and non-uniform exploration, and present non-asymptotic regret bounds using novel tools from empirical process theory. Furthermore, we devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature.

450Local Patterns Generalize Better for Novel Anomalies

[openreview] [pdf]

Abstract Video anomaly detection (VAD) aims at identifying novel actions or events which are unseen during training. Existing mainstream VAD techniques typically focus on the global patterns of events but struggle to generalize to novel samples. In this paper, we propose a framework that identifies the local patterns which generalize to novel samples and models the dynamics of local patterns. The capability of extracting spatial local patterns is achieved through a two-stage process involving image-text alignment and cross-modality attention. Generalizable representations are built by focusing on text-informative features that filter out unnecessary visual data variances. To enhance spatial local patterns with temporal clues, we introduce a State Machine Module (SMM) that combines tokens from different moments to improve sentence generation within cross-modality attention. Furthermore, temporal motion estimation complements spatial local patterns to detect anomalies characterized by novel spatial distributions or distinctive dynamics. Extensive experiments on popular benchmark datasets demonstrate the achievement of state-of-the-art performance. Code is available athttps://anonymous.4open.science/r/Local-Patterns-Generalize-Better-1E30/.

451State & Image Guidance: Teaching Old Text-to-Video Diffusion Models New Tricks

[openreview] [pdf]

Abstract Current text-to-video (T2V) models have made significant progress in generating high-quality video. However, these models are limited when it comes to generating dynamic video scenes where the description per frame can vary dramatically. Changing the color, shape, position and state of objects in the scene is a challenge that current video models cannot handle. In addition, the lack of a cheap image-based conditioning mechanism limits their creative application. To address these challenges and extend the applicability of T2V models, we propose two innovative approaches:State GuidanceandImage Guidance.State Guidanceuses advanced guidance mechanisms to control motion dynamics and scene transformation smoothness by navigating the diffusion process between a state triplet <initial state, transition state, final state>. This mechanism enables the generation of dynamic video scenes (Dynamic Scene T2V) and allows to control the speed and the expressiveness of the scene transformation by introducing temporal dynamics via a guidance weight schedule across video frames.Image Guidanceenables Zero-Shot Image-to-Video generation (Zero-Shot I2V) by injecting reference image into the initial diffusion steps noise predictions. Furthermore, the combination ofState GuidanceandImage Guidanceallows for zero-shot transitions between two input reference frames of a video (Zero-Shot II2V). Finally, we introduce the novelDynamic Scene Benchmarkto evaluate the ability of the models to generate dynamic video scenes. Extensive experiments show thatState GuidanceandImage Guidancesuccessfully address the aforementioned challenges and significantly improve the generation capabilities of existing T2V architectures.

452UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations

[openreview] [pdf]

Abstract We address the problem of offline learning a policy that avoids undesirable demonstrations. Unlike conventional offline imitation learning approaches that aim to imitate expert or near-optimal demonstrations, our setting involves avoiding undesirable behavior (specified using undesirable demonstrations). To tackle this problem, unlike standard imitation learning where the aim is to minimize the distance between learning policy and expert demonstrations, we formulate the learning task as maximizing a statistical distance, in the space of state-action stationary distributions, between the learning policy and the undesirable policy. This significantly different approach results in a novel training objective that necessitates a new algorithm to address it. Our algorithm, UNIQ, tackles these challenges by building on the inverse Q-learning framework, framing the learning problem as a cooperative (non-adversarial) task. We then demonstrate how to efficiently leverage unlabeled data for practical training. Our method is evaluated on standard benchmark environments, where it consistently outperforms state-of-the-art baselines.

453The Directionality of Optimization Trajectories in Neural Networks

[openreview] [pdf]

Abstract The regularity or implicit bias in neural network optimization has been typically studied via the parameter norms or the landscape curvature, often overlooking the trajectory leading to these parameters. However, properties of the trajectory --- particularly its directionality --- capture critical aspects of how gradient descent navigates the landscape to converge to a solution. In this work, we introduce the notion of a Trajectory Map and derive natural complexity measures that highlight the directional characteristics of optimization trajectories. Our comprehensive analysis across vision and language modeling tasks reveals that (a) the trajectory’s directionality at the macro-level saturates by the initial phase of training, wherein weight decay and momentum play a crucial but understated role; and (b) in subsequent training, trajectory directionality manifests in micro-level behaviors, such as oscillations, for which we also provide a theoretical analysis. This implies that neural optimization trajectories have, overall, a more linear form than zig-zaggy, as evident by high directional similarity, especially towards the end. To further hone this point, we show that when the trajectory direction gathers such an inertia, optimization proceeds largely unaltered even if the network is severely decapacitated (by freezing >99% of the parameters), --- thereby demonstrating the potential for significant computational and resource savings without compromising performance.

454TWO STAGES DOMAIN INVARIANT REPRESENTATION LEARNERS SOLVE THE LARGE CO-VARIATE SHIFT IN UNSUPERVISED DOMAIN ADAPTATION WITH TWO DIMENSIONAL DATA DOMAINS

[openreview] [pdf]

Abstract Recent developments in the unsupervised domain adaptation (UDA) enable the unsupervised machine learning (ML) prediction for target data, thus this will accelerate real world applications with ML models such as image recognition tasks in self-driving. Researchers have reported the UDA techniques are not working well under large co-variate shift problems where e.g. supervised source data consists of handwritten digits data in monotone color and unsupervised target data colored digits data from the street view. Thus there is a need for a method to resolve co-variate shift and transfer source labelling rules under this dynamics. We perform two stages domain invariant representation learning to bridge the gap between source and target with semantic intermediate data (unsupervised). The proposed method can learn domain invariant features simultaneously between source and intermediate also intermediate and target. Finally this achieves good domain invariant representation between source and target plus task discriminability owing to source labels. This induction for the gradient descent search greatly eases learning convergence in terms of classification performance for target data even when large co-variate shift. We also derive a theorem for measuring the gap between trained models and unsupervised target labelling rules, which is necessary for the free parameters optimization. Finally we demonstrate that proposing method is superiority to previous UDA methods using 4 representative ML classification datasets including 38 UDA tasks. Our experiment will be a basis for challenging UDA problems with large co-variate shift.

455Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization

[openreview] [pdf]

Abstract The Hierarchical Navigable Small World (HNSW) algorithm is widely used for approximate nearest neighbor (ANN) search, leveraging the principles of navigable small-world graphs. However, it faces some limitations. The first is the local optima problem, which arises from the algorithm’s greedy search strategy, selecting neighbors based solely on proximity at each step. This often leads to cluster disconnections. The second limitation is that HNSW frequently fails to achieve logarithmic complexity, particularly in high-dimensional datasets, due to the exhaustive traversal through each layer. To address these limitations, we propose a novel algorithm that mitigates local optima and cluster disconnections while improving inference speed. The first component is a dual-branch HNSW structure with LID-based insertion mechanisms, enabling traversal from multiple directions. This improves outlier node capture, enhances cluster connectivity, and reduces the risk of local minima. The second component introduces a bridge-building technique that adds shortcuts between layers, enabling direct jumps and speeding up inference. Experiments on various benchmarks and datasets showed that our algorithm outperforms the original HNSW in both accuracy and speed. We evaluated six datasets across Computer Vision (CV), deep learning (DL), and Natural Language Processing (NLP), showing improvements of 2.5% in NLP, 15% in DL, and up to 35% in CV tasks. Inference speed is also improved by 12% across all datasets. Ablation studies revealed that LID-based insertion had the greatest impact on performance, followed by the dual-branch structure and bridge-building components.

456Toward Exploratory Inverse Constraint Inference with Generative Diffusion Verifiers

[openreview] [pdf]

Abstract An important prerequisite for safe control is aligning the policy with the underlying constraints in the environment. In many real-world applications, due to the difficulty of manually specifying these constraints, existing works have proposed recovering constraints from expert demonstrations by solving the Inverse Constraint Learning (ICL) problem. However, ICL is inherently ill-posed, as multiple constraints can equivalently explain the experts’ preferences, making the optimal solutions not uniquely identifiable. In this work, instead of focusing solely on a single constraint, we propose the novel approach of Exploratory ICL (ExICL). The goal of ExICL is to recover a diverse set of feasible constraints, thereby providing practitioners the flexibility to select the most appropriate constraint based on the needs of practical deployment. To achieve this goal, we design a generative diffusion verifier, which guides the trajectory generation process using the probabilistic representation of an optimal constrained policy. By comparing these decisions with those made by expert agents, we can efficiently verify a candidate constraint. Driven by the verification feedback, ExICL implements an exploratory constraint update mechanism that strategically facilitates the diversity within the collection of feasible constraints. Our empirical results demonstrate that ExICL can seamlessly and reliably generalize across different tasks and environments.

457DRIVE: Distributional Model-Based Reinforcement Learning via Variational Inference

[openreview] [pdf]

Abstract Distributional reinforcement learning (RL) provides a natural framework for estimating the distribution of returns rather than a single expected value. However, the control aspect of distributional RL has not been as thoroughly explored as the evaluation part, typically relying on the greedy selection rule with respect to either the expected value, akin to standard approaches, or risk-sensitive measures derived from the return distribution. On the other hand, casting RL as a probabilistic inference problem allows for flexible control solutions utilizing a toolbox of approximate inference techniques; however, its connection to distributional RL remains underexplored. In this paper, we bridge this gap by proposing a variational approach for efficient policy search. Our method leverages the log-likelihood of optimality as a learning proxy, decoupling it from traditional value functions. This learning proxy incorporates aleatoric uncertainty of the return distribution, enabling risk-aware decision-making. We provide a theoretical analysis of our framework, detailing the conditions for convergence. Empirical results on vision-based tasks in DMControl Suite demonstrate the effectiveness of our approach compared to various algorithms, as well as its ability to balance exploration and exploitation at different training stages.

458Learn from the Past: Dynamic Data Pruning with Historically Weighted Bernoulli Sampling

[openreview] [pdf]

Abstract Dynamic data pruning, which also known as data importance sampling, has been proposed to improve training efficiency. For the case of sampling with replacement, the optimal sampling distribution to minimize the variance is to sample proportional to the gradient norm, which can be approximated by the gradient norm of the logits from an extra forward pass. However, this could result in repeated samples, which can be an undesirable property. Noticing that most dynamic data pruning methods that avoids repeated samples can be seen as weighted Bernoulli sampling, in this work we study the optimal distribution to reduce its variance. Furthermore, to avoid an extra forward pass, we study the use of historic statistics. We propose the use of exponential moving average and probability smoothing to improve the performance.

459A Simple Baseline for Predicting Future Events with Auto-Regressive Tabular Transformers

[openreview] [pdf]

Abstract Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance. Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data’s features themselves. In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective. Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases. We demonstrate that the same model can predict labels, impute missing values, or model event sequences.

460Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks

[openreview] [pdf]

Abstract Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. We propose a general framework to address this issue. In our framework, the goal is to learn a set of policies—a policy committee—such that at least one is near-optimal for most tasks that may be encountered at execution time. While we show that even a special case of this problem is inapproximable, we present two effective algorithmic approaches for it. The first of these yields provably approximation guarantees, albeit in small-dimensional settings (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide provable sample complexity bounds for few-shot learning settings. Our experiments in personalized and multi-task RL settings using MuJoCo and Meta-World benchmarks show that the proposed approach outperforms state-of-the-art multi-task, meta-, and personalized RL baselines on training and test tasks, as well as in few-shot learning, often by a large margin.

461The Phase Transition Phenomenon of Shuffled Regression

[openreview] [pdf]

Abstract We study the phase transition phenomenon inherent in the shuffled (permuted) regression problem, which has found numerous applications in databases, privacy, data analysis, etc. For the permuted regression task: Y=ΠXB\mathbf{Y} = \mathbf{\Pi}\mathbf{X}\mathbf{B}, the goal is to recover the permutation matrix Π\mathbf{\Pi} as well as the coefficient matrix B\mathbf{B}. It has been empirically observed in prior studies that when recovering Π\mathbf{\Pi}, there exists a phase transition phenomenon: the error rate drops to zero rapidly once the parameters reach certain thresholds. In this study, we aim to precisely identify the locations of the phase transition points by leveraging techniques from {\em message passing} (MP).In our analysis, we first transform the permutation recovery problem into a probabilistic graphical model. Then, we leverage the analytical tools rooted in the message passing (MP) algorithm and derive an equation to track the convergence of the MP algorithm. By linking this equation to the branching random walk process, we are able to characterize the impact of the \emph{signal-to-noise-ratio} (snr\mathsf{snr}) on the permutation recovery. Depending on whether the signal is given or not, we separately investigate the oracle case and the non-oracle case. The bottleneck in identifying the phase transition regimes lies in deriving closed-form formulas for the corresponding critical points, but only in rare scenarios can one obtain such precise expressions. To tackle this challenge, we propose the Gaussian approximation method, which allows us to obtain the closed-form formulas in almost all scenarios. In the oracle case, our method can fairly accurately predict the phase transition snr\mathsf{snr}. In the non-oracle case, our proposed algorithm can predict the maximum allowed number of permuted rows and uncover its dependency on the sample number.

462Group Distributionally Robust Dataset Distillation with Risk Minimization

[openreview] [pdf]

Abstract Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments.

463Learning the Partially Dynamic Travelling Salesman Problem

[openreview] [pdf]

Abstract Learning to solve the Travelling Salesman Problem (TSP) using Deep Reinforcement Learning (Deep RL) and Graph Neural Networks (GNNs) has shown promising results for small instances of the problem. We demonstrate that these methods can be extended to solve instances of a partially dynamic variant of the TSP. Solving this partially dynamic variant more effectively exploits the strengths of reinforcement learning and also presents challenges for more established methods of solving the TSP. We show the policies trained using Deep RL outperform modified versions of TSP solvers and heuristics for different distributions of dynamic vertices, including on larger instances than the policies were trained on. This shows the promise of Deep RL for solving this type of dynamic routing problem which is predicted to become of great importance as logistical services become more flexible and responsive to customer demand. Furthermore, our method is a general purpose approach to Deep RL where the problem consists of selecting items from a dynamically-evolving and arbitrarily-sized set.

464Value Residual Learning For Alleviating Attention Concentration In Transformers

[openreview] [pdf]

Abstract Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KVKV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

465SELF-EVOLVED REWARD LEARNING FOR LLMS

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences and is a key factor in the success of modern conversational models like GPT-4, ChatGPT, and Llama 2. A significant challenge in employing RLHF lies in training a reliable RM, which relies on high-quality labels. Typically, these labels are provided by human experts or a stronger AI, both of which can be costly and introduce bias that may affect the language model’s responses. As models improve, human input may become less effective in enhancing their performance. This paper explores the potential of using the RM itself to generate additional training data for a more robust RM. Our experiments demonstrate that reinforcement learning from self-feedback outperforms baseline approaches. We conducted extensive experiments with our approach on multiple datasets, such as HH-RLHF and UltraFeedback, and models including Mistral and Llama 3, comparing it against various baselines. Our results indicate that, even with a limited amount of human-labeled data, learning from self-feedback can robustly enhance the performance of the RM, thereby improving the capabilities of large language models.

466DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

[openreview] [pdf]

Abstract Restless multi-armed bandits (RMAB) has been widely used to model constrained sequential decision making problems, where the state of each restless arm evolves according to a Markov chain and each state transition generates a scalar reward. However, the success of RMAB crucially relies on the availability and quality of reward signals. Unfortunately, specifying an exact reward function in practice can be challenging and even infeasible. In this paper, we introduce Pref-RMAB, a new RMAB model in the presence of preference signals, where the decision maker only observes pairwise preference feedback rather than scalar reward from the activated arms at each decision epoch. Preference feedback, however, arguably contains less information than the scalar reward, which makes Pref-RMAB seemingly more difficult. To address this challenge, we present a direct online preference learning (DOPL) algorithm for Pref-RMAB to efficiently explore the unknown environments, adaptively collect preference data in an online manner, and directly leverage the preference feedback for decision-makings. We prove that DOPL yields a sublinear regret. To our best knowledge, this is the first algorithm to ensure O~(TlnT)\tilde{\mathcal{O}}(\sqrt{T\ln T}) regret for RMAB with preference feedback. Experimental results further demonstrate the effectiveness of DOPL.

467Off-Policy Maximum Entropy RL with Visitation Measures

[openreview] [pdf]

Abstract We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) during the next time steps. We prove that this distribution is the fixed point of a contractive operator. Furthermore, the problem of maximizing the expected discounted sum of these intrinsic rewards is proven to be an approximation of the minimization of an upper bound on the suboptimality gap of the state-action value function of the policy. We finally describe how existing algorithms can integrate these intrinsic rewards to enhance exploration and introduce a practical algorithm for learning this fixed point off-policy, using state-action transitions, relying on N-step bootstrapping of the operator. Empirically, this maximum entropy reinforcement learning framework provides exploration policies with good coverage of the state-action space, and high-performing control policies, which both can be computed off-policy.

468How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework

[openreview] [pdf]

Abstract Discrete diffusion models have gained increasing attention for their ability to model complex distributions with tractable sampling and inference. However, the error analysis for discrete diffusion models remains less well-understood. In this work, we propose a comprehensive framework for the error analysis of discrete diffusion models based on Lévy-type stochastic integrals. By generalizing the Poisson random measure to that with a time-independent and state-dependent intensity, we rigorously establish a stochastic integral formulation of discrete diffusion models and provide the corresponding change of measure theorems that are intriguingly analogous to Itô integrals and Girsanov’s theorem for their continuous counterparts. Our framework unifies and strengthens the current theoretical results on discrete diffusion models and obtains the first error bound for the τ-leaping scheme in KL divergence. With error sources clearly identified, our analysis gives new insight into the mathematical properties of discrete diffusion models and offers guidance for the design of efficient and accurate algorithms for real-world discrete diffusion model applications.

469Only-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization

[openreview] [pdf]

Abstract Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we conduct a rigorous investigation into the factors that enable generalization to unseen instructions. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization only emerges\textbf{only emerges} when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model’s adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of specialist\textit{\textbf{{specialist}}} and generalist\textit{\textbf{{generalist}}} models. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. . Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.

470Learning Interpretable and Influential Directions with Signal Vectors and Uncertainty Region Alignment

[openreview] [pdf]

Abstract Latent space directions have played a key role in understanding, debugging, and fixing deep learning models. Concepts are often encoded in distinct feature space directions, and evaluating impact of these directions on the model’s predictions, highlights their importance in the decision-making process. Additionally, recent studies have shown that penalizing directions associated with spurious artifacts during training can force models to unlearn features irrelevant to their prediction task. Identifying these directions, therefore, provides numerous benefits, including a deeper understanding of the model’s strategy, fostering trust, and enabling model correction and improvement. We introduce a novel unsupervised approach utilizing signal vectors and uncertainty region alignment to discover latent space directions that meet two key debugging criteria: significant influence on model predictions and high level of interpretability. To our knowledge, this method is the first of its kind to uncover such directions, leveraging the inherent structure of the feature space and the knowledge encoded in the deep network. We validate our approach using both synthetic and real-world benchmarks, demonstrating that the discovered directions effectively fulfill the critical debugging criteria.

471Unlocking Guidance for Discrete State-Space Diffusion and Flow Models

[openreview] [pdf]

Abstract Generative models on discrete state-spaces have a wide range of potential applications, particularly in the domain of natural sciences. In continuous state-spaces, controllable and flexible generation of samples with desired properties has been realized using guidance on diffusion and flow models. However, these guidance approaches are not readily amenable to discrete state-space models. Consequently, we introduce a general and principled method for applying guidance on such models. Our method depends on leveraging continuous-time Markov processes on discrete state-spaces, which unlocks computational tractability for sampling from a desired guided distribution. We demonstrate the utility of our approach, Discrete Guidance, on a range of applications including guided generation of small-molecules, DNA sequences and protein sequences.

472Learning and Steering Game Dynamics Towards Desirable Outcomes

[openreview] [pdf]

Abstract Game dynamics, which describe how agents’ strategies evolve over time based on past interactions, can exhibit a variety of undesirable behaviours, including convergence to suboptimal equilibria, cycling, and chaos. While central planners can employ incentives to mitigate such behaviors and steer game dynamics towards desirable outcomes, the effectiveness of such interventions critically relies on accurately predicting agents’ responses to these incentives---a task made particularly challenging when the underlying dynamics are unknown and observations are limited. To address this challenge, this work introduces the Side Information Assisted Regression with Model Predictive Control (SIAR-MPC) framework. We extend the recently introduced SIAR method to incorporate the effect of control, enabling it to utilize side-information constraints inherent to game theoretic applications to model agent responses to incentives from scarce data. MPC then leverages this model to implement adaptive incentive adjustments. Our experiments demonstrate the efficiency of SIAR-MPC in guiding systems towards socially optimal equilibria, stabilizing chaotic and cycling behaviors. Comparative analyses in data-scarce settings show SIAR-MPC’s superior performance compared to pairing MPC with state-of-the-art alternatives like Sparse Identification of Nonlinear Dynamics (SINDy) and Physics Informed Neural Networks (PINNs).

473Nesterov acceleration in benignly non-convex landscapes

[openreview] [pdf]

Abstract While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign’ non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov’s accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

474Toward Efficient Multi-Agent Exploration With Trajectory Entropy Maximization

[openreview] [pdf]

Abstract Recent works have increasingly focused on learning decentralized policies for agents as a solution to the scalability challenges in Multi-Agent Reinforcement Learning (MARL), where agents typically share the parameters of a policy network to make action decisions. However, this parameter sharing can impede efficient exploration, as it may lead to similar behaviors among agents. Different from previous mutual information-based methods that promote multi-agent diversity, we introduce a novel multi-agent exploration method called Trajectory Entropy Exploration (TEE). Our method employs a particle-based entropy estimator to maximize the entropy of different agents’ trajectories in a contrastive trajectory representation space, resulting in diverse trajectories and efficient exploration. This entropy estimator avoids challenging density modeling and scales effectively in high-dimensional multi-agent settings. We integrate our method with MARL algorithms by deploying an intrinsic reward for each agent to encourage entropy maximization. To validate the effectiveness of our method, we test our method in challenging multi-agent tasks from several MARL benchmarks. The results demonstrate that our method consistently outperforms existing state-of-the-art methods.

475Probing the Latent Hierarchical Structure of Data via Diffusion Models

[openreview] [pdf]

Abstract High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. Forward-backward experiments in diffusion-based models, where a datum is noised and then denoised, are a promising tool to achieve these goals. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results suggest that forward-backward experiments are informative on the nature of latent variables, and that the effect of changing deeper ones is revealed near the transition.

476TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model

[openreview] [pdf]

Abstract With recent advances in building foundation models for text and video data, there is a surge of interest in foundation modeling for time series. Many families of models have been developed utilizing a temporal autoregressive Transformer architecture, whose effectiveness has been proven in Large Language Models (LLMs). However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the unidirectional nature of temporally autoregressive decoding typically learns a deterministic mapping relationship and limits the incorporation of domain knowledge, such as physical laws. To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that jointly leverages the transformer inductive bias to capture temporal dependencies and the diffusion processes to generate high-quality candidate samples. The proposed mask unit for task-agnostic pretraining and task-specific sampling enables direct processing of multivariate inputs even with missing values or multi-resolution. Furthermore, we introduce a theoretically justified finetuning-free model editing strategy that allows the flexible integration of external knowledge during the sampling process. Extensive experiments conducted on a variety of tasks, such as forecasting, imputation, and anomaly detection highlight TimeDiT’s adaptability as a foundation model, addressing diverse time series challenges and advancing analysis in various fields.

477InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

[openreview] [pdf]

Abstract Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents’ ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics

478Noise Prompt Learning: Learning the Winning Tickets for Diffusion Sampling

[openreview] [pdf]

Abstract Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are winning tickets that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those winning noises. To learn winning noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the noise prompt\textit{noise prompt}, which aims at turning a random Gaussian noise into a winning noise ticket by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the noise prompt learning\textit{noise prompt learning} framework that systematically learns "prompted’’ winning noise tickets associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale noise prompt dataset\textit{noise prompt dataset} (NPD) that contains 100k pairs of random noises and winning noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small noise prompt network\textit{noise prompt network} (NPNet) that can directly learn to transform a random noise ticket into a winning noise ticket. The learned winning noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a winning noise instead of a random noise without accessing the original pipeline.

479WARP: On the Benefits of Weight Averaged Rewarded Policies

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) aligns large language models by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its initialization, though it hinders the reward optimization. To address the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP), merging policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration’s final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with Gemma policies validate that WARP improves their quality and alignment, outperforming open-source models.

480Unifying Back-Propagation and Forward-Forward Algorithms through Model Predictive Control

[openreview] [pdf]

Abstract We introduce a Model Predictive Control (MPC) framework for training deep neural networks, systematically unifying the Back-Propagation (BP) and Forward-Forward (FF) algorithms. At the same time, it gives rise to a range of intermediate training algorithms with varying look-forward horizons, leading to a performance-efficiency trade-off. We perform a precise analysis of this trade-off on a deep linear network, where the qualitative conclusions carry over to general networks. Based on our analysis, we propose a principled method to choose the optimization horizon based on given objectives and model specifications. Numerical results on various models and tasks demonstrate the versatility of our method.

481Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation

[openreview] [pdf]

Abstract Finetuning language models for a new domain inevitably leads to the deterioration of their general performance. This becomes more pronounced the more limited the finetuning data resource.We introduce minifinetuning (MFT), a method for language model domain adaptation that considerably reduces the effects of overfitting-induced degeneralization in low-data settings and which does so in the absence of any pre-training data for replay. MFT demonstrates 2-10x more favourable specialization-to-degeneralization ratios than standard finetuning across a wide range of models and domains and exhibits an intrinsic robustness to overfitting when data in the new domain is scarce and down to as little as 500 samples.Employing corrective self-distillation that is individualized on the sample level, MFT outperforms parameter-efficient finetuning methods, demonstrates replay-like forgetting mitigation properties, and is composable with either for a combined effect.

482Investigating Memorization in Video Diffusion Models

[openreview] [pdf]

Abstract Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this, we introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. By applying these metrics, we systematically analyze memorization in various pretrained VDMs, including text-conditional and unconditional models on various datasets, revealing that memorization is widespread across both video and image datasets. Finally, we propose effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

483THE ROBUSTNESS OF DIFFERENTIABLE CAUSAL DISCOVERY IN MISSPECIFIED SCENARIOS

[openreview] [pdf]

Abstract Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit counter-intuitive robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.

484ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

[openreview] [pdf]

Abstract Reward shaping is a critical component in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. While shaping rewards have been introduced to provide additional guidance, selecting effective shaping functions remains challenging and computationally expensive. This paper introduces Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames shaping reward selection as an online model selection problem. ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention, balancing exploration and exploitation with provable regret guarantees. We demonstrate ORSO’s effectiveness across various continuous control tasks using the Isaac Gym simulator. Compared to traditional methods that fully evaluate each shaping reward function, ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards.

485The Hidden Cost of Waiting for Accurate Predictions

[openreview] [pdf]

Abstract Algorithmic predictions are increasingly informing societal resource allocations by identifying individuals for targeting. Policymakers often build these systems with the assumption that by gathering more observations on individuals, they can improve predictive accuracy and, consequently, allocation efficiency. An overlooked yet consequential aspect of prediction-driven allocations is that of timing. The planner has to trade off relying on earlier and potentially noisier predictions to intervene before individuals experience undesirable outcomes, or they may wait to gather more observations to make more precise allocations. We examine this tension using a simple mathematical model, where the planner collects observations on individuals to improve predictions over time. We analyze both the ranking induced by these predictions and optimal resource allocation. We show that though individual prediction accuracy may improve over time, counter-intuitively, the average ranking loss can worsen. As a result, the planner’s ability to improve social welfare can decline. We identify inequality as a driving factor behind this phenomenon. Our findings provide a nuanced perspective and challenge the conversational wisdom that it is preferable to wait for more accurate predictions to ensure the most efficient allocations.

486Stable batched bandit: Optimal regret with free inference

[openreview] [pdf]

Abstract In this paper, we discuss statistical inference when using a sequential strategy to collect data. While inferential tasks become challenging with sequentially collected data, we argue that this problem can be alleviated when the sequential algorithm satisfies certain stability properties; we call such algorithms stable bandit algorithms. Focusing on batched bandit problems, we first demonstrate that popular algorithms including the greedy-UCB algorithm and ε-greedy ETC algorithms are not stable, complicating downstream inferential tasks. Our main result shows that a form of elimination algorithm is stable in the batched bandit setup, and we characterize the asymptotic distribution of the sample means. This result allows us to construct asymptotically exact confidence intervals for arm-means which are sharper than existing concentration-based bounds. As a byproduct of our main results, we propose an Explore and Commit (ETC) strategy, which is stable --- thus allowing easy statistical inference--- and also attains optimal regret up to a factor of 4.Our work connects two historically conflicting paradigms in sequential learning environments: regret minimization and statistical inference. Ultimately, we demonstrate that it is possible to minimize regret without sacrificing the ease of performing statistical inference, bridging the gap between these two important aspects of sequential decision-making.

487Secure Diffusion Model Unlocked: Efficient Inference via Score Distillation

[openreview] [pdf]

Abstract As services based on diffusion models expand across various domains, preserving the privacy of client data becomes more critical. Fully homomorphic encryption and secure multi-party computation have been employed for privacy-preserving inference, but these methods are computationally expensive and primarily work for linear computations, making them challenging to apply to large diffusion models. While homomorphic encryption has been recently applied to diffusion models, it falls short of fully safeguarding privacy, as inputs used in the ε prediction are not encrypted. In this paper, we propose a novel framework for private inference for both inputs and outputs. To ensure robust approximations, we introduce several techniques for handling non-linear operations. Additionally, to reduce latency, we curtail the number of denoising steps while minimizing performance degradation of conditional generation through score distillation from the unconditional generation of the original model with full denoising steps. Experimental results show that our model produces high-quality images comparable to the original, and the proposed score distillation significantly enhances performance, compensating for fewer steps and approximation errors.

488A Simple Approach to Unifying Diffusion-based Conditional Generation

[openreview] [pdf]

Abstract Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simplified formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

489Retrieval Augmented Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose, RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model’s learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model’s capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on eight benchmark datasets show that RAFT consistently outperforms contemporary baselines, an average win ratio of 86% for multivariate forecasting and 80% for univariate forecasting tasks.

490Subject Information Extraction for Novelty Detection with Domain Shifts

[openreview] [pdf]

Abstract Unsupervised novelty detection (UND), aimed at identifying novel samples, is essential in fields like medical diagnosis, cybersecurity, and industrial quality control. Most existing UND methods assume that the training data and testing normal data originate from the same domain and only consider the distribution variation between training data and testing data. However, in real scenarios, it is common for normal testing and training data to originate from different domains, a challenge known as domain shift. The discrepancies between training and testing data often lead to incorrect classification of normal data as novel by existing methods. A typical situation is that testing normal data and training data describe the same subject, yet they differ in the background conditions. To address this problem, we introduce a novel method that separates subject information from background variation encapsulating the domain information to enhance detection performance under domain shifts. The proposed method minimizes the mutual information between the representations of the subject and background while modelling the background variation using a deep Gaussian mixture model, where the novelty detection is conducted on the subject representations solely and hence is not affected by the variation of domains. Extensive experiments demonstrate that our model generalizes effectively to unseen domains and significantly outperforms baseline methods, especially under substantial domain shifts between training and testing data.

491Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

[openreview] [pdf]

Abstract As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion (~4.7%) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher’s reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbf{E}-\textbf{D}riven key reason\textbf{I}ng step distilla\textbf{T}ion (\textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs. Code can be found athttps://anonymous.4open.science/r/eb77sh-F564

492Learning Time-shared Hidden Heterogeneity for Counterfactual Outcome Forecast

[openreview] [pdf]

Abstract Forecasting counterfactual outcome in the longitudinal setting can be critical for many time-related applications. To solve this problem, the previous works propose to apply different sequence models including long short-term memory (LSTM) networks and transformers to model the relationship between the observed histories, treatments and outcomes, and apply various approaches to remove treatment selection bias. However, these methods neglect the hidden heterogeneity of outcome generation among samples induced by hidden factors which can bring hurdles to counterfactual outcome forecast. To alleviate this problem, we capture the hidden heterogeneity by recovering the hidden factors and incorporate it into the outcome prediction process. Specifically, we propose a Time-shared Heterogeneity Learning from Time Series (THLTS) method which infers the shared part of hidden factors characterizing the heterogeneity across time steps with the architecture of variational encoders (VAE). This method can be a flexible component and combined with arbitrary counterfactual outcome forecast method. Experimental results on (semi-)synthetic datasets demonstrate that combined with our method, the mainstream models can improve their performance.

493Understanding Distribution Alignment Through Category Separability In An Infant-Inspired Domain Adaptation Task

[openreview] [pdf]

Abstract We introduce a novel distribution shift considering the tradeoff between object instances and viewpoints occurring in human and embodied visual experience; we study this problem through the lens of domain adaptation. We show that the performance of a well-known domain adaptation method, Joint Adaptation Network (JAN), deteriorates in the absence of ImageNet pretraining. We hypothesize that the separability of source and target category clusters in the feature space plays a crucial role in the effectiveness of JAN. To this end, we propose 3 metrics to measure category separability in the feature space and show that separability in the pretrained network is strongly correlated with downstream JAN accuracy. Further, we propose two novel loss functions increasing target separability by aligning the distribution of within-domain pairwise distances between the source and target cluster. Our experiments show that the application of these loss functions improves downstream performance on the test set.

494DisCoNet: Rethinking Adversarial Networks for Discriminator-Driven Distribution Modeling

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection holds significant importance across various applications. While semantic and domain-shift OOD problems are well-documented, this work focuses on the nuances of covariate shifts, which entail subtle perturbations or variations in the data distribution. These disturbances have proven to negatively impact machine learning performance. We have found that existing OOD detection methods often struggle to effectively distinguish covariate shifts from in-distribution instances, emphasizing the need for specialized solutions. Therefore, we propose DisCoNet, an Adversarial Variational Autoencoder (VAE) that rethinks the Generative Adversarial Networks paradigm. Instead of prioritizing the generator as the network’s core, we focus on the discriminator, using the generator as a supporting training tool. DisCoNet uses the VAE’s suboptimal outputs as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this in-distribution boundary, DisCoNet achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 98.9% AUROC on ImageNet-1K(-C), but also outperforms all prior methods on public semantic OOD benchmarks. With a model size of 25MB, it is highly effective on Far-OOD (OpenImage-O (99.4%) and iNaturalist (100.0%)) and Near-OOD (SSB-hard (99.9%) and NINCO (99.7%)) detection. The code will be made publicly available.

495Unveiling the Secret of AdaLN-Zero in Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion transformer (DiT), a rapidly emerging architecture for image generation, has gained much attention. However, despite ongoing efforts to improve its performance, the understanding of DiT remains superficial. In this work, we delve into and investigate a critical conditioning mechanism within DiT, adaLN-Zero, which achieves superior performance compared to adaLN. Our work studies three potential elements driving this performance, including an SE-like structure, zero-initialization, and a “gradual” update order, among which zero-initialization is proved to be the most influential. Building on this insight, we heuristically leverage Gaussian distributions to initialize each condition modulation, termed adaLN-Gaussian, leading to more stable and effective training. Extensive experiments following DiT on ImageNet1K demonstrate the effectiveness and generalization of adaLN-Gaussian, e.g., a notable improvement of 2.16% in FID score over adaLN-Zero.

496Safety Alignment Shouldn’t Be Complicated

[openreview] [pdf]

Abstract As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components \textbf{(7.5%)} during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units \textbf{(20%)} in the pre-trained model as an ``alignment budget’’ can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

497Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

[openreview] [pdf]

Abstract We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4’s performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.

498Never Forget the Basics: In-distribution Knowledge Retention for Continual Test-time Adaptation in Human Motion Prediction

[openreview] [pdf]

Abstract This paper presents a novel approach to addressing the underexplored challenge of human pose prediction in dynamic target domains that simultaneously contain in-distribution (ID) and out-of-distribution (OOD) data. Existing test-time adaptation (TTA) techniques predominantly focus on OOD data, neglecting the fact that ID data, which closely resembles the training distribution, is often encountered during real-world deployment, leading to significant degradation in ID performance. To address this, we introduce In-Distribution Knowledge Retention (IDKR), a continual TTA framework designed to preserve critical knowledge about ID data while adapting to unseen OOD sequences. Our method introduces an ID-informative subgraph learning strategy that leverages the structural characteristics of human skeletal data to compute a structural graph Fisher Information Matrix (SG-FIM). Unlike prior work, IDKR simultaneously considers both node and edge features in the skeletal graph, with edge features, representing the invariant bone lengths between parent-child joint pairs, being essential for maintaining structural consistency across poses. These edge features are key to extracting reliable SG-FIM parameters, enabling the model to retain parameters critical for ID performance while selectively updating those needed for OOD adaptation. Extensive experiments on multiple benchmark datasets demonstrate that IDKR consistently outperforms state-of-the-art methods, particularly in scenarios involving mixed ID and OOD data, setting a new standard for robust human pose prediction in dynamic environments.

499Flexible Active Learning of PDE Trajectories

[openreview] [pdf]

Abstract Accurately solving partial differential equations (PDEs) is critical for understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient ground-truth data from numerical simulations. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces the data acquisition cost and improves model accuracy. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically queries only a subset of the time steps from a numerical solver along a trajectory, while employing a surrogate model to approximate values for the remaining steps. This dramatically reduces the cost of data acquisition, which is proportional to the number of time steps simulated by the numerical solver, and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same computational budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Heat equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, and the incompressible Navier-Stokes equation. Extensive experiments validate that our approach outperforms existing methods, offering a cost-efficient solution to surrogate modeling for PDEs.

500Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

[openreview] [pdf]

Abstract In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent’s high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent’s actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is possible with a sample complexity that depends only on the size of the controllable latent space, and not on the size of the noise factor. However, this prior work has focused on the episodic setting, where the controllable latent state resets to a specific start state after a finite horizon.By contrast, if the agent can only interact with the environment in a single continuous trajectory, prior works have not established sample-complexity bounds. We propose STEEL, the first provably sample-efficient algorithm for learning the controllable dynamics of an Ex-BMDP from a single trajectory, in the function approximation setting. STEEL has a sample complexity that depends only on the sizes of the controllable latent space and the encoder function class, and (at worst linearly) on the mixing time of the exogenous noise factor. We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems.

501Diffusion Process with Implicit Latents via Energy Models

[openreview] [pdf]

Abstract We present a generative model based on an ordered sequence of latent variables for intermediate distributions between a given source and a desired target distribution. We construct the probabilistic transitions among the latent variables using energy models that are in the form of classifiers. In our work, the intermediate transitional distributions are implicitly defined by the energy models during training, where the statistical properties of the data distribution are naturally taken into account. This is in contrast to denoising diffusion probabilistic models (DDPMs) where they are explicitly defined by the predefined scheduling of a sequential noise degradation process. Over the course of training, our model is designed to optimally determine the intermediate distributions by Langevin dynamics driven by the energy model. In contrast, energy-based models (EBMs) typically require an additional generator since the intermediate distributions are not explicitly defined in the training procedure. We demonstrate the effectiveness and efficiency of the proposed algorithm in the context of image generation, achieving high fidelity results with less inference steps on a variety of datasets.

502SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

[openreview] [pdf]

Abstract In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state’s stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

503The Crucial Role of Samplers in Online Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, theoptimizationproperties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO’sconvergence rateswith different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieveslinearconvergence, while our proposed online sampler achievesquadraticconvergence. We further adapt the sampler to practical settings by incorporating posterior distributions andlogit mixing, demonstrating significant improvements over previous approaches. On Safe-RLHF dataset, our method exhibits a 4.5% improvement over vanilla DPO and a 3.0% improvement over on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO, on-policy DPO, and Hybrid GSHF by over 4.2%. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.

504From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training

[openreview] [pdf]

Abstract We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.

505SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

[openreview] [pdf]

Abstract As large language models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into reinforcement learning from human feedback (RLHF). However, these approaches tend to be complex and often unstable, as they encompass complicated procedures in RLHF along with additional procedures required by the safety constraints. Inspired by direct preference optimization (DPO), we introduce a new algorithm called \textit{SafeDPO}, which is designed to implicitly optimize the safety alignment objective within a single stage of policy learning. The resulting algorithm can be implemented by introducing only one additional hyperparameter, which aims to further enhance safety, along with minor modifications to the DPO implementation. Consequently, SafeDPO successfully eliminates the necessity of fitting a reward and a cost model, as well as sampling from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to the current state-of-the-art safety alignment algorithm, both in terms of aligning with human preferences and improving safety.

506Towards Understanding the Universality of Transformers for Next-Token Prediction

[openreview] [pdf]

Abstract Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token xt+1x_{t+1} given an autoregressive sequence (x1,,xt)(x_1, \dots, x_t) as a prompt, where xt+1=f(xt) x_{t+1} = f(x_t) , and f f is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when f f is linear or when (xt) (x_t) is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping ff in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates xt+1x_{t+1} based solely on past and current observations (x1,,xt) (x_1, \dots, x_t) , with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings ff.

507Exploration in the Face of Strategic Responses: Provable Learning of Online Stackelberg Games

[openreview] [pdf]

Abstract We study online leader-follower games where the leader interacts with a myopic follower using a quantal response policy. The leader’s objective is to design an algorithm without prior knowledge of her reward function or the state transition dynamics. Crucially, the leader also lacks insight into the follower’s reward function and realized rewards, posing a significant challenge. To address this, the leader must learn the follower’s quantal response mapping solely through strategic interactions --- announcing policies and observing responses. We introduce a unified algorithm, Planning after Estimation, which updates the leader’s policies in a two-step approach.In particular, we first jointly estimate the leader’s value function and the follower’s response mapping by maximizing a sum of the Bellman error of the value function, the likelihood of the quantal response model, and a regularization term that encourages exploration. The leader’s policy is then updated through a greedy planning step based on these estimates. Our algorithm achieves a T\sqrt{T}-regret in the context of general function approximation. Moroever, this algorithm avoids the intractable optimistic planning and thus enhances implementation simplicity.

508Cross-Domain Off-Policy Evaluation and Learning for Contextual Bandits

[openreview] [pdf]

Abstract Off-Policy Evaluation and Learning (OPE/L) in contextual bandits is rapidly gaining popularity in real systems because new policies can be evaluated and learned securely using only historical logged data. However, existing methods in OPE/L cannot handle many challenging but prevalent scenarios such as few-shot data, deterministic logging policies, and new actions. In many applications, such as personalized medicine, content recommendations, education, and advertising, we need to evaluate and learn new policies in the presence of these challenges. Existing methods cannot evaluate and optimize effectively in these situations due to the notorious variance issue or limited exploration in the logged data. To enable OPE/L even under these unsolved challenges, we propose a new problem setup of Cross-Domain OPE/L, where we have access not only to the logged data from the target domain in which the new policy will be implemented but also to logged datasets collected from other domains. This novel formulation is widely applicable because we can often use historical data not only from the target hospital, country, device, or user segment but also from other hospitals, countries, devices, or segments. We develop a new estimator and policy gradient method to solve OPE/L by leveraging both target and source datasets, resulting in substantially enhanced OPE/L in the previously unsolved situations in our empirical evaluations.

509Graph Concept Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existing CBMs assume concepts are conditionally independent given labels and isolated from each other, ignoring the hidden relationships among concepts. However, the set of concepts in CBMs often has an intrinsic structure where concepts are generally correlated: changing one concept will inherently impact its related concepts. To mitigate this limitation, we proposeGraph CBMs: a new variant of CBM that facilitates concept relationships by constructing latent concept graphs, which can be combined with CBMs to enhance model performance while retaining their interpretability. Empirical results on real-world image classification tasks demonstrate Graph CBMs are (1) superior in image classification tasks while providing more concept structure information for interpretability; (2) able to utilize concept graphs for more effective interventions; and (3) robust across different training and architecture settings.

510Series-to-Series Diffusion Bridge Model

[openreview] [pdf]

Abstract Diffusion models have risen to prominence in time series forecasting, showcasing their robust capability to model complex data distributions. However, their effectiveness in deterministic predictions is often constrained by instability arising from their inherent stochasticity. In this paper, we revisit time series diffusion models and present a comprehensive framework that encompasses most existing diffusion-based methods. Building on this theoretical foundation, we propose a novel diffusion-based time series forecasting model, the Series-to-Series Diffusion Bridge Model (S2DBM\mathrm{S^2DBM}), which leverages the Brownian Bridge process to reduce randomness in reverse estimations and improves accuracy by incorporating informative priors and conditions derived from historical time series data. Experimental results demonstrate that S2DBM\mathrm{S^2DBM} delivers superior performance in point-to-point forecasting and competes effectively with other diffusion-based models in probabilistic forecasting.

511Flow Tree: A dynamic model for navigation paths and strategies

[openreview] [pdf]

Abstract Navigation is a dynamic process that involves learning how to represent the environment, along with positions in and trajectories through it. Spatial navigation skills vary significantly among individual humans. But what exactly differentiates a good navigator from a bad one, or an easy-to-navigate path from a hard one, is not well understood. Several studies have analysed exploration and navigation behaviour using static quantitative measures, like counts of positions visited or distance travelled. These static measures, however, are inherently limited in their ability to describe dynamic behaviors, providing a coarse quantification of the navigation process. To fill this gap, we introduce the \emph{Flow Tree}, a novel data structure, which quantifies the dynamics of a group of trajectories through time. This is a discrete adaptation of the Reeb graph, a mathematical structure from topology, computed from multiple trajectories (from different people or the same person over time). Each divergence in trajectory is captured as a node, encoding the variability of the collection of trajectories. A Flow Tree encodes how difficult it will be to navigate a certain path for a group of humans. We apply the Flow Tree to a behavioural dataset of 100 humans exploring and then navigating a small, closed-form maze in virtual reality. In this paper we (1) describe what a Flow Tree is and how to calculate it, (2) show that Flow Trees can be used to predict path difficulty more effectively than static metrics, and (3) demonstrate that a trajectory through the Flow Tree is predictive of that individual’s success. We (4) introduce a hypothesis testing framework over Flow Trees to quantitatively differentiate between the strategies of the best navigators from those of worst. Thus, we show that Flow Trees are a powerful tool to analyse dynamic trajectory data.\footnote{The code will be made publicly available at [anon-github-link].}

512Oracle efficient truncated statistics

[openreview] [pdf]

Abstract We study the problem of learning from truncated samples: instead of observing samples from some underlying population pp^\ast, we observe only the examples that fall in some survival set SRdS \subset \mathbb{R}^d whose probability mass (measured with respect to pp^\ast) is at least α. Assuming membership oracle access to the truncation set SS, prior works obtained algorithms for the case where pp^\ast is Gaussian or more generally an exponential family with strongly convex likelihood --- albeit with a super-polynomial dependency on the (inverse) survival mass 1/α1/\alpha both in terms of runtime and in number of oracle calls to the set SS. In this work we design a new learning method with runtime and query complexity polynomial in 1/α1/\alpha.Our result significantly improves over the prior works by focusing on efficiently solving the underlying optimization problem using a general purpose optimization algorithm with minimal assumptions.

513Model Collapse in the Chain of Diffusion Finetuning: A Novel Perspective from Quantitative Trait Modeling

[openreview] [pdf]

Abstract The success of generative models has reached a unique threshold where their outputs are indistinguishable from real data, leading to the inevitable contamination of future data collection pipelines with synthetic data. While their potential to generate infinite samples initially offers promise for reducing data collection costs and addressing challenges in data-scarce fields, the severe degradation in performance has been observed when iterative loops of training and generation occur---known as ‘‘model collapse.’’ This paper explores a practical scenario in which a pretrained text-to-image diffusion model is finetuned using synthetic images generated from a previous iteration, a process we refer to as the ‘‘Chain of Diffusion.’’ We first demonstrate the significant degradation in image qualities caused by this iterative process and identify the key factor driving this decline through rigorous empirical investigations. Drawing on an analogy between the Chain of Diffusion and biological evolution, we then introduce a novel theoretical analysis based on quantitative trait modeling. Our theoretical analysis aligns with empirical observations of the generated images in the Chain of Diffusion. Finally, we propose Reusable Diffusion Finetuning (ReDiFine), a simple yet effective strategy inspired by genetic mutations. ReDiFine mitigates model collapse without requiring any hyperparameter tuning, making it a plug-and-play solution for reusable image generation.

514Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

[openreview] [pdf]

Abstract Recent studies have demonstrated that Transformers can perform in-context reinforcement learning (RL) by imitating a source RL algorithm. This enables them to adapt to new tasks in a sample-efficient manner without parameter updates. However, since the Transformers are trained to mimic the source algorithm, they also reproduce its suboptimal behaviors. Model-based planning offers a promising solution to this limitation by allowing the agents to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the source algorithm’s behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where the Transformer simultaneously learns environment dynamics and improves policy in-context. With experiments across a diverse set of discrete and continuous environments such as Darkroom variants and Meta-World, we show that this method achieves state-of-the-art performance, requiring significantly fewer environmental interactions than the baselines including both in-context model-free counterparts and existing meta-RL methods.

515Exploring Complex Trade-offs in Information Bottleneck through Multi-Objective Optimization

[openreview] [pdf]

Abstract Information Bottleneck (IB) theory provides a principled approach to analyze and optimize how neural networks extract and learn latent representations from data, aiming to enhance network performance and generalization. The IB framework has been applied and validated across various domains in deep learning. However, most studies employing IB require tuning of Lagrange multipliers to balance compression and prediction during optimization. Finding the optimal Lagrange multiplier β to achieve the best balance between compression and prediction is challenging, relying heavily on empirical tuning and potentially failing to capture the complex trade-offs present within the IB paradigm. In this paper, we redefine the IB problem as a multi-objective optimization problem with respect to compression and prediction objectives. We employ a gradient-based multi-objective optimization algorithm that adaptively determines the weights for this optimization challenge. Our method is demonstrated to automatically find Pareto-optimal solutions, achieving a balance between compression and prediction, and exploring more complex Pareto frontiers than linear weighting. We compare our approach with the Variational Information Bottleneck and its variants across different datasets. Empirical results confirm that our method achieves a more stable and optimal trade-off compared to Information Bottleneck approaches with manually-tuned multipliers. The code is available in \url{https://anonymous.4open.science/r/ASDGASDG}.

516Mostly Exploration-free Algorithms for Multi-Objective Linear Bandits

[openreview] [pdf]

Abstract We address the challenge of solving multi-objective bandit problems, which are increasingly relevant in real-world applications where multiple possibly conflicting objectives must be optimized simultaneously. Existing multi-objective algorithms often rely on complex, computationally intensive methods, making them impractical for real-world use. In this paper, we propose a novel perspective by showing that objective diversity can naturally induce free exploration, allowing for simpler, near-greedy algorithms to achieve state-of-the-art regret bounds. We introduce simple and efficient algorithms for multi-objective linear bandits, which do not require constructing empirical Pareto fronts and achieve a regret bound of O~(dT)\tilde{\mathcal{O}}(\sqrt{dT}) under sufficient objective diversity and suitable regularity. We also introduce the concept of objective fairness, ensuring equal treatment of all objectives, and show that our algorithms satisfy this criterion. Numerical experiments validate our theoretical findings, demonstrating that objective diversity can enhance algorithm performance while simplifying the solution process.

517Learning Generalizable and Well-Shaped Reward Functions from Too Few Demonstrations

[openreview] [pdf]

Abstract Inverse reinforcement learning (IRL) is an important problem that aims to learn a reward function and policy directly from demonstrations, which can often be easier to provide than a well-shaped reward function. However, many real-world tasks include natural variations (i.e., a cleaning robot in a house with different furniture configurations), making it costly to provide demonstrations of every possible scenario. We tackle the problem of few-shot IRL with multi-task data where the goal is for an agent to learn from a few demonstrations, not sufficient to fully specify the task, by utilizing an offline multi-task demonstration dataset. Prior work utilizes meta-learning or imitation learning which additionally requires reward labels, a multi-task training environment, or cannot improve with online interactions. We propose Multitask Discriminator Proximity-guided IRL (MPIRL), an IRL method that learns a generalizable and well-shaped reward function by learning a multi-task generative adversarial discriminator with an auxiliary proximity-to-expert reward. We demonstrate the effectiveness of our method on multiple navigation and manipulation tasks.

518Partially Observed Trajectory Inference using Optimal Transport and a Dynamics Prior

[openreview] [pdf]

Abstract Trajectory inference seeks to recover the temporal dynamics of a population from snapshots of its (uncoupled) temporal marginals, i.e. where observed particles are \emph{not} tracked over time. Lavenant et al. (2023) addressed this challenging problem under a stochastic differential equation (SDE) model with a gradient-driven drift in the observed space, introducing a minimum entropy estimator relative to the Wiener measure. Chizat et al. (2022) then provided a practical grid-free mean-field Langevin (MFL) algorithm using Schrodinger bridges. Motivated by the overwhelming success of observable state space models in the traditional paired trajectory inference problem (e.g. target tracking), we extend the above framework to a class of latent SDEs in the form of \emph{observable state space models}. In this setting, we use partial observations to infer trajectories in the latent space under a specified dynamics model (e.g. the constant velocity/acceleration models from target tracking). We introduce PO-MFL to solve this latent trajectory inference problem and provide theoretical guarantees by extending the results of Lavenant et al. (2023) to the partially observed setting. We leverage the MFL framework of Chizat et al. (2022), yielding an algorithm based on entropic OT between dynamics-adjusted adjacent time marginals. Experiments validate the robustness of our method and the exponential convergence of the MFL dynamics, and demonstrate significant outperformance over the latent-free method of Chizat et al. (2022) in key scenarios.

[openreview] [pdf]

Abstract Maximum Inner Product Search (MIPS) is essential for machine learning and information retrieval, particularly in applications that operate on high-dimensional data, such as recommender systems and retrieval-augmented generation (RAG), using inner product or cosine similarity. While numerous techniques have been developed for efficient MIPS, their performance often suffers due to a limited understanding of the geometric properties of Inner Product (IP) space. Many approaches reduce MIPS to Nearest Neighbor Search (NNS) through nonlinear transformations, which rely on strong assumptions and can hinder performance. To address these limitations, we propose a novel approach that directly leverages the geometry of IP space. We focus on a class of special vectors called dominators and introduce the Monotonic Relative Dominator Graph MRDG, an IP-space-native, sparse, and strongly-connected graph designed for efficient MIPS, offering theoretical solid foundations. To ensure scalability, we further introduce the Approximate Relative Dominator Graph (ARDG), which retains MRDG’s benefits while significantly reducing indexing complexity. Extensive experiments on 8 public datasets demonstrate that ARDG achieves a 30% average speedup in search at high precision and reduces index size by 2x compared to state-of-the-art graph-based methods.

520IO-LVM: Inverse optimization latent variable models with applications to inferring and explaining paths

[openreview] [pdf]

Abstract Learning representations from solutions of constrained optimization problems (COPs) with unknown cost functions is challenging, as models like (Variational) Autoencoders struggle to capture constraints to decode structured outputs. We propose an inverse optimization latent variable model (IO-LVM) that constructs a latent space of COP costs based on observed decisions, enabling the inference of feasible and meaningful solutions by reconstructing them with a COP solver. To achieve this, we leverage estimated gradients of a Fenchel-Young loss through a non-differentiable deterministic solver while shaping the embedding space. In contrast to established Inverse Optimization or Inverse Reinforcement Learning methods, which typically identify a single or context-conditioned cost function, we exploit the learned representation to capture underlying COP cost structures and identify solutions likely originating from different agents, each using distinct or slightly different cost functions when making decisions. Using both synthetic and actual ship routing data, we validate our approach through experiments on path planning problems using the Dijkstra algorithm, demonstrating the interpretability of the latent space and its effectiveness in path inference and path distribution reconstruction.

521Evolving Multi-Scale Normalization for Time Series Forecasting under Distribution Shifts

[openreview] [pdf]

Abstract Complex distribution shifts are the main obstacle to achieving accurate long-term time series forecasting. Several efforts have been conducted to capture the distribution characteristics and propose adaptive normalization techniques to alleviate the influence of distribution shifts. However, these methods neglect intricate distribution dynamics that are observed from various scales and the evolving functions of both distribution dynamics and normalized mapping relationships. To this end, we propose a novel model-agnostic Evolving Multi-Scale Normalization (EvoMSN) framework to tackle the distribution shift problem. Flexible normalization and denormalization are proposed based on the multi-scale statistics prediction module and adaptive ensembling. An evolving optimization strategy is designed to update the forecasting model and statistics prediction module collaboratively to track the shifting distributions. We evaluate the effectiveness of EvoMSN in improving the performance of five mainstream forecasting methods on benchmark datasets and also show its superiority compared to existing advanced normalization and online learning approaches.

522Federated Maximum Likelihood Inverse Reinforcement Learning with Convergence Guarantee

[openreview] [pdf]

Abstract Inverse Reinforcement Learning (IRL) aims to recover the latent reward function and corresponding optimal policy from observed demonstrations. Existing IRL research predominantly focuses on a centralized learning approach, not suitable for real-world problems with distributed data and privacy restrictions. To this end, this paper proposes a novel algorithm for federated maximum-likelihood IRL (F-ML-IRL) and provides a rigorous analysis of its convergence and time-complexity. The proposed F-ML-IRL leverages a dual-aggregation to update the shared global model and performs bi-level local updates -- an upper-level learning task to optimize the parameterized reward function by maximizing the discounted likelihood of observing expert trajectories under the current policy and a low-level learning task to find the optimal policy concerning the entropy-regularized discounted cumulative reward under the current reward function. We analyze the convergence and time-complexity of the proposed F-ML-IRL algorithm and show that the global model in F-ML-IRL converges to a stationary point for both the reward and policy parameters within finite time, i.e., the log-distance between the recovered policy and the optimal policy, as well as the gradient of the likelihood objective, converge to zero. Finally, evaluating our F-ML-IRL algorithm on high-dimensional robotic control tasks in MuJoCo, we show that it ensures convergences of the recovered reward in decentralized learning and even outperforms centralized baselines due to its ability to utilize distributed data.

523Is Large-scale Pretraining the Secret to Good Domain Generalization?

[openreview] [pdf]

Abstract Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.

524FunBO: Discovering Acquisition Functions forBayesian Optimization with FunSearch

[openreview] [pdf]

Abstract The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.

525Optimizing Knowledge Distillation in Transformers: Enabling Power of Multi-Head Attention without Alignment Barriers

[openreview] [pdf]

Abstract Knowledge distillation has been proven effective for compressing transformer architectures by transferring knowledge from teacher to student models. Logits-based methods of knowledge distillation cannot fully capture the intermediate representations and features within the teacher model, which may result in the student model not fully learning all the knowledge from the teacher model. Thus, previous work focuses on transferring knowledge through intermediate features or attention maps. However, leveraging multi-head attention maps in transformers for knowledge distillation presents challenges due to head misalignment and suboptimal feature alignment, often requiring projectors to align features or special modifications to the model architecture. To address above limitations, we propose the Squeezing-Heads Distillation (SHD) method. This method reduces the number of attention maps to any desired number through linear approximation, without requiring additional projectors or parameters. This facilitates better alignment and knowledge transfer between models with different numbers of heads, enhancing both flexibility and efficiency. Experimental results demonstrate significant improvements in both language and vision generative models, validating the effectiveness of our method.

526Machine Unlearning For Alleviating Negative Transfer In Partial-Set Source-Free Unsupervised Domain Adaptation

[openreview] [pdf]

Abstract Source-free Unsupervised Domain Adaptation (SFUDA) aims to adjust a source model trained on a labeled source domain to a related but unlabeled target domain without accessing the source data. Many SFUDA methods are studied in closed-set scenarios where the target domain and source domain categories are perfectly aligned. However, a more practical scenario is a partial-set scenario where the source label space subsumes the target one. In this paper, we prove that reducing the differences between the source and target domains in the partial-set scenario helps to achieve domain adaptation. And we propose a simple yet effective SFUDA framework called the Machine Unlearning Framework to alleviate the negative transfer problem in the partial-set scenario, thereby allowing the model to focus on the target domain category. Specifically, we first generate noise samples for each category that only exists in the source domain and generate pseudo-labeled samples from the target domain. Then, in the forgetting stage, we use these samples to train the model, making it behave like the model has never seen the class that only exists in the source domain before. Finally, in the adaptation stage, we use only the pseudo-labeled samples to conduct self-supervised training on the model, making it more adaptable to the target domain. Our method is easy to implement and pluggable, suitable for various pre-trained models. Experimental results show that our method can well alleviate the negative transfer problem and improve model performance under various target domain category settings.

527Propensity-driven Uncertainty Learning for Sample Exploration in Source-Free Active Domain Adaptation

[openreview] [pdf]

Abstract Source-free active domain adaptation (SFADA) addresses the challenge of adapting a pre-trained model to new domains without access to source data while minimizing the need for target domain annotations. This scenario is particularly relevant in real-world applications where data privacy, storage limitations, or labeling costs are significant concerns. Key challenges in SFADA include selecting the most informative samples from the target domain for labeling, effectively leveraging both labeled and unlabeled target data, and adapting the model without relying on source domain information. Additionally, existing methods often struggle with noisy or outlier samples and may require impractical progressive labeling during training. To effectively select more informative samples without frequently requesting human annotations, we propose the Propensity-driven Uncertainty Learning (ProULearn) framework. ProULearn utilizes a novel homogeneity propensity estimation mechanism combined with correlation index calculation to evaluate feature-level relationships. This approach enables the identification of representative and challenging samples while avoiding noisy outliers. Additionally, we develop a central correlation loss to refine pseudo-labels and create compact class distributions during adaptation. In this way, ProULearn effectively bridges the domain gap and maximizes adaptation performance. The principles of informative sample selection underlying ProULearn have broad implications beyond SFADA, offering benefits across various deep learning tasks where identifying key data points or features is crucial. Extensive experiments on four benchmark datasets demonstrate that ProULearn consistently outperforms state-of-the-art methods in domain adaptation scenarios.

528Influential Language Data Selection via Gradient Trajectory Pursuit

[openreview] [pdf]

Abstract Curating a desirable dataset for training has been the core of building highly capable large language models (Touvron et al., 2023; Achiam et al., 2023; Team et al., 2024). Gradient influence scores (Pruthi et al., 2020; Xia et al., 2024) have been shown to be correlated with model performance and are commonly used as the criterion for data selection. However, existing methods are built upon either individual sample rankings or inefficient matching process, leading to suboptimal performance or scaling up issues. In this paper, we propose Gradient Trajectory Pursuit (GTP), an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. The proposed algorithm highlights: (1) joint selection instead of independent top-k selection, which automatically de-duplicates samples; (2) higher efficiency with compressive sampling processes, which can be further sped up using a distributed framework. In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks and show that it outperforms top-k selection and competitive algorithms consistently, for example, our algorithm chooses as low as 0.5% data to achieve full performance on the targeted instruction tuning tasks.

529Counterfactual Learning under Rank Preservation

[openreview] [pdf]

Abstract Counterfactual inference aims to estimate the counterfactual outcome given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. Specifically, we introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.

530FAST: Federated Average with Snapshot Unleashes Arbitrary Client Participation

[openreview] [pdf]

Abstract Federated Learning (FL) provides a flexible distributed platform where numerous clients with high degrees of heterogeneity in data and system can collaborate to learn a model jointly. Previous research has shown that Federated Learning is effective in handling diverse data, but often assumes idealized conditions. Specifically, client participation is often simplified in these studies, while real-world factors make it difficult to predict or design individual client participation. This complexity often diverges from the ideal client participation assumption, rendering an unknown pattern of client participation, referred to asarbitrary client participation. Hence, it is an important open problem to explore the impact of client participation and find a lightweight mechanism to enable arbitrary client participation in FL. In this paper, we first empirically investigate the influence of client participation on FL, revealing that FL algorithms are significantly impacted by arbitrary client participation. Afterward, to alleviate the influence, we propose a lightweight solution, Federated Average with Snapshot (FAST), to unleash the almost arbitrary client participation for FL. It can seamlessly integrate with other classic FL algorithms. Specifically, FAST enforces the clients to take a snapshot once in a while and facilitates arbitrary client participation for the majority of the training process. We show the convergence rates of FAST in non-convex and strongly-convex cases, which could match the rates with those in ideal client participation. Furthermore, we empirically introduce an adaptive strategy for dynamically configuring the snapshot frequency, tailored to accommodate diverse FL systems. Our extensive numerical results demonstrate that our FAST algorithm attains significant improvements under the conditions of arbitrary client participation and highly heterogeneous data.

531VideoPanda: Video Panoramic Diffusion With Multi-view Attention

[openreview] [pdf]

Abstract High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce \ourmodel, a novel approach for synthesizing 360360^\circ videos conditioned on text or single-view video data. \ourmodel leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. \ourmodel is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that \ourmodel generates more realistic and coherent 360360^\circ panoramas across all input conditions compared to existing methods. Visit the project website athttps://mvpanovideo.github.io/VideoPanda/for results.

532Many-Objective Multi-Solution Transport

[openreview] [pdf]

Abstract Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce ‘‘Many-objective multi-solution Transport (MosT)’’, a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. MosT formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, MosT distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives.

533Taming Continuous Spurious Shift in Domain Adaptation

[openreview] [pdf]

Abstract Recent advances in domain adaptation have shown promise in transferring knowledge across domains characterized by a continuous value or vector, such as varying patient ages, where "age’’ serves as a continuous index. However, these approaches often fail when spurious features shift continuously along with the domain index. This paper introduces the first method designed to withstand the continuous shifting of spurious features during domain adaptation. Our method enhances domain adaptation performance by aligning causally transportable encodings across continuously indexed domains. Theoretical analysis demonstrates that our approach more effectively ensures causal transportability across different domains. Empirical results, from both semi-synthetic and real-world medical datasets, indicate that our method outperforms state-of-the-art domain adaptation methods.

534Understanding Constraint Inference in Safety-Critical Inverse Reinforcement Learning

[openreview] [pdf]

Abstract In practical applications, the underlying constraint knowledge is often unknown and difficult to specify. To address this issue, recent advances in Inverse Constrained Reinforcement Learning (ICRL) have focused on inferring these constraints from expert demonstrations. However, the ICRL approach typically characterizes constraint learning as a tri-level optimization problem, which is inherently complex due to its interdependent variables and multiple layers of optimization. Considering these challenges, a critical question arises:Can we implicitly embed constraint signals into reward functions and effectively solve this problem using a classic reward inference algorithm?The resulting method, known as Inverse Reward Correction (IRC), merits investigation. In this work, we conduct a theoretical analysis comparing the sample complexities of both solvers. Our findings confirm that the IRC solver achieves lower sample complexity than its ICRL counterpart. Nevertheless, this reduction in complexity comes at the expense of generalizability. Specifically, in the target environment, the reward correction terms may fail to guarantee the safety of the resulting policy, whereas this issue can be effectively mitigated by transferring the constraints via the ICRL solver. Advancing our inquiry, we investigate conditions under which the ICRL solver ensures ε-optimality when transferring to new environments. Empirical results across various environments validate our theoretical findings, underscoring the nuanced trade-offs between complexity reduction and generalizability in safety-critical applications.

535Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved impressive results across numerous NLP tasks but still encounter difficulties in machine translation. Traditional methods to improve translation have typically involved fine-tuning LLMs using parallel corpora. However, vanilla fine-tuning often leads to catastrophic forgetting of the instruction-following capabilities and alignment with human preferences, compromising their broad general abilities and introducing potential security risks. These abilities, which are developed using proprietary and unavailable training data, make existing continual instruction tuning methods ineffective. To overcome this issue, we propose a novel approach called RaDis\textbf{RaDis} (Ra\textbf{Ra}tionale Dis\textbf{Dis}tillation). RaDis harnesses the strong generative capabilities of LLMs to create rationales for training data, which are then “replayed” to prevent forgetting. These rationales encapsulate general knowledge and safety principles\textit{encapsulate general knowledge and safety principles} and act as self-distillation targets\textit{self-distillation targets} to regulate the training process. By jointly training on both reference translations and self-generated rationales, the model can learn new translation skills while preserving its overall general abilities. Extensive experiments demonstrate that our method enhances machine translation performance while maintaining the broader capabilities of LLMs across other tasks. This work presents a pathway for creating more versatile LLMs that excel in specialized tasks without compromising generality and safety.

536Early Period of Training Impacts Adaptation for Out-of-Distribution Generalization: An Empirical Study

[openreview] [pdf]

Abstract Prior research shows that differences in the early period of neural network training significantly impact the performance of in-distribution (ID) data of tasks. Yet, the implications of early learning dynamics on out-of-distribution (OOD) generalization remain poorly understood, primarily due to the complexities and limitations of existing analytical techniques. In this work, we investigate the relationship between learning dynamics, OOD generalization under covariate shift and the early period of neural network training. We utilize the trace of Fisher Information and sharpness, focusing on gradual unfreezing (i.e., progressively unfreezing parameters during training) as our methodology for investigation. Through a series of empirical experiments, we show that 1) changing the number of trainable parameters during the early period of training via gradual unfreezing can significantly improve OOD results; 2) the trace of Fisher Information and sharpness can be used as indicators for the removal of interventions during the early period of training for better OOD generalization. Our experiments on both image and text data show that the early period of training is a general phenomenon that can provide Pareto improvements in ID and OOD performance with minimal complexity. Our work represents a first step towards understanding how early learning dynamics affect neural network OOD generalization and suggests a new avenue to improve and study this problem.

537Prompt Diffusion Robustifies Any-Modality Prompt Learning

[openreview] [pdf]

Abstract Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

538Is the Fairness Metric Truly Fair?

[openreview] [pdf]

Abstract Image classification is a fundamental task in computer vision that has been widely adopted in critical applications such as face recognition and medical imaging, drawing considerable attention to its predictive fairness. Some researchers have proposed various fairness metrics and pipelines to enhance the fairness of deep learning models. However, recent studies indicate that existing fairness evaluation specifications and metrics have inherent flaws, as they focus on low-dimensional inputs, such as numerical data, and overlook partial correlations between target and sensitive attributes, leading to some degree of mutual exclusivity. This raises the question: Is the fairness metric truly fair? Through in-depth analysis, experiments conclude that the fairness of deep models is closely related to attribute sampling and the interdependencies among attributes. In this work, we address this challenge by introducing a new specification based on dynamic perturbation for image classification models. Specifically, we introduce an Attribute Projection Perturbation Strategy (APPS) that moves beyond the constraints of directly statistical discrete predictions by mapping sensitive attributes that may influence task attributes onto the same dimension for evaluation. Building on this, a Projection Fairness Metric System is proposed to quantifing the upper and lower bounds of fairness perturbations, examining and evaluating the impact of mapped sensitive attributes on the fairness of task predictions from different perspectives. Additionally, we conducted systematic evaluation experiments and extensive discussions, demonstrating that the proposed evaluation specification offers better objectivity and interpretability compared to existing metrics, in 24 image classification models including CNN and ViT architectures. It is hoped that this work will promote the standardization of fairness evaluation pipeline and metrics.

539Do Influence Functions Work on Large Language Models?

[openreview] [pdf]

Abstract Influence functions aim to quantify the impact of individual training data points on a model’s predictions. While extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Our study thus suggests the need for alternative approaches for identifying influential samples. To support future work, our code is made available athttps://github.com/anonymous.

540Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO), and its numerous variants, are increasingly used for aligning language models. Although they are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we termlikelihood displacement. We demonstrate that likelihood displacement can becatastrophic, shifting probability mass from preferred responses to semantically opposite ones. As a simple example, training a model to prefer No\texttt{No} over Never\texttt{Never} can sharply increase the probability of Yes\texttt{Yes}. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement canunintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by acentered hidden embedding similarity (CHES)score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

541Towards counterfactual fairness thorough auxiliary variables

[openreview] [pdf]

Abstract The challenge of balancing fairness and predictive accuracy in machine learning models, especially when sensitive attributes such as race, gender, or age are considered, has motivated substantial research in recent years. Counterfactual fairness ensures that predictions remain consistent across counterfactual variations of sensitive attributes, which is a crucial concept in addressing societal biases. However, existing counterfactual fairness approaches usually overlook intrinsic information about sensitive features, limiting their ability to achieve fairness while simultaneously maintaining performance. To tackle this challenge, we introduce EXOgenous Causal reasoning (EXOC), a novel causal reasoning framework motivated by exogenous variables. It leverages auxiliary variables to uncover intrinsic properties that give rise to sensitive attributes. Our framework explicitly defines an auxiliary node and a control node that contribute to counterfactual fairness and control the information flow within the model. Our evaluation, conducted on synthetic and real-world datasets, validates EXOC’s superiority, showing that it outperforms state-of-the-art approaches in achieving counterfactual fairness without sacrificing accuracy.

542Broaden your SCOPE! Efficient Conversation Planning for LLMs with Semantic Space

[openreview] [pdf]

Abstract Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This gives the advantage of allowing the optimal LLM response to be selected at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget.

543Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

[openreview] [pdf]

Abstract For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to tune language models to easily maximize non-differentiable objectives according to the LLM designer’s preferences (e.g., using simpler language or minimizing specific kinds of harmful content). These may neither align with user preferences nor even be able to be captured tractably by binary preference data. To leverage the simplicity and performance of DPO with the generalizability of RL, we propose a hybrid approach between DPO and RLHF. With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards using offline RL. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives, while preserving alignment performance across a range of challenging benchmarks and model sizes.

544No-Regret is not enough! Bandits with General Constraints through Adaptive Regret Minimization

[openreview] [pdf]

Abstract In the bandits with knapsacks framework (BwK) the learner has mm resource-consumption (i.e., packing) constraints. We focus on the generalization of BwK in which the learner has a set of general long-term constraints. The goal of the learner is to maximize their cumulative reward, while at the same time achieving small cumulative constraints violations. In this scenario, there exist simple instances where conventional methods for BwK fail to yield sublinear violations of constraints. We show that it is possible to circumvent this issue by requiring the primal and dual algorithm to be weakly adaptive. Indeed, even in absence on any information on the Slater’s parameter ρ characterizing the problem, the interplay between weakly adaptive primal and dual regret minimizers yields a ``self-bounding’’ property of dual variables. In particular, their norm remains suitably upper bounded across the entire time horizon even without explicit projection steps. By exploiting this property, we provide best-of-both-worlds guarantees for stochastic and adversarial inputs. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio of ρ/(1+ρ)\rho/(1+\rho). In both settings, constraints violations are guaranteed to be sublinear in time. Finally, this results allow us to obtain new result for the problem of contextual bandits with linear constraints, providing the first no-α-regret guarantees for adversarial contexts.

545Reconstruct the Understanding of Grokking through Dynamical Systems

[openreview] [pdf]

Abstract \textbf{Grokking}, or the \textbf{delayed generalization phenomenon}, describes the abrupt and rapid improvement in test accuracy that occurs after a model has been overfitted for a prolonged period. This phenomenon was first identified by Power in the context of operations on a prime number field. Over the past two years, a range of mathematical analyses has been conducted to investigate grokking, typically involving the use of the hidden progress measure which mean a function that can anticipate the occurrence of grokking. We believe that a comprehensive and rigorous mathematical modeling approach can invigorate the research on this task and provide a unified perspective for understanding previous research. This paper introduces a novel approach by modeling the task as a unique dynamical system. Using mathematical derivation within this framework, we propose a robust hidden progress measure that effectively captures the grokking phenomenon across all operations on prime number fields. This approach not only provides a more complete understanding but also offers deeper insights into the underlying architecture of the model. Based on this understanding, we also proposed a method to accelerate grokking without involving regularization or altering the model architecture.

546SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

[openreview] [pdf]

Abstract The development of diffusion models has led to significant progress in image and video generation tasks, with pre-trained models like the Stable Diffusion series playing a crucial role. However, a key challenge remains in downstream task applications: how to effectively and efficiently adapt pre-trained diffusion models to new tasks. Inspired by model pruning which lightens large pre-trained models by removing unimportant parameters, we propose a novel model fine-tuning method to make full use of these ineffective parameters and enable the pre-trained model with new task-specified capabilities. In this work, we first investigate the importance of parameters in pre-trained diffusion models and discover that parameters with the smallest absolute values do not contribute to the generation process due to training instabilities. Based on this observation, we propose a fine-tuning method termed SaRA that re-utilizes these temporarily ineffective parameters, equating to optimizing a sparse weight matrix to learn the task-specific knowledge. To mitigate potential overfitting, we propose a nuclear-norm-based low-rank sparse training scheme for efficient fine-tuning. Furthermore, we design a new progressive parameter adjustment strategy to make full use of the finetuned parameters. Finally, we propose a novel unstructural backpropagation strategy, which significantly reduces memory costs during fine-tuning. Our method enhances the generative capabilities of pre-trained models in downstream applications and outperforms existing fine-tuning methods in maintaining model’s generalization ability.

547Enriching Knowledge Distillation with Intra-Class Contrastive Learning

[openreview] [pdf]

Abstract Since the advent of knowledge distillation, much research has focused on how the soft labels generated by the teacher model can be utilized effectively. A study points out that the implicit knowledge within soft labels originates from the multi-view structure present in the data. Feature variations within samples of the same class allow the student model to generalize better by learning diverse representations. However, in existing distillation methods, teacher models predominantly adhere to ground-truth labels as targets, without considering the diverse representations within the same class. Therefore, we propose incorporating an intra-class contrastive loss during teacher training to enrich the intra-class information contained in soft labels. In practice, we find that intra-class loss causes instability in training and slows convergence. To mitigate these issues, margin loss is integrated into intra-class contrastive learning to improve the training stability and convergence speed. Simultaneously, we theoretically analyze the impact of this loss on the intra-class distances and inter-class distances. It has been proved that the intra-class contrastive loss can enrich the intra-class diversity. Experimental results demonstrate the effectiveness of the proposed method.

548Diffusion-Nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

[openreview] [pdf]

Abstract Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

549Boundless Socratic Learning

[openreview] [pdf]

Abstract An agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its coverage of experience/data is broad enough, and (c) it has sufficient capacity and resource. In this white paper, we justify these conditions, and consider what limitations arise from (a) and (b) in closed systems, when assuming that (c) is not a bottleneck. Considering the special case of agents with matching input and output spaces (namely, language), we argue that such pure recursive self-improvement, dubbed “Socratic learning”, can boost performance vastly beyond what is present in its initial data or knowledge, and is only limited by time, as well as gradual misalignment concerns. Furthermore, we propose a constructive framework to implement it, based on the notion oflanguage games.

550Turning Challenges into Opportunities: How Distribution Shifts Enhance Identifiability in Causal Representation Learning

[openreview] [pdf]

Abstract Causal representation learning seeks to uncover latent causal variables and their relationships from observed, unstructured data, a task complicated by identifiability challenges. While distribution shifts, viewed as natural interventions on latent causal variables, often present difficulties in traditional machine learning tasks, they also create valuable opportunities for identifiability by introducing variability in latent variables. In this paper, we study a non-parametric condition characterizing the types of distribution shifts that contribute to identifiability within the context of latent additive noise models. We also present partial identifiability results when only a portion of distribution shifts meets the condition. Furthermore, we extend our findings to latent post-nonlinear causal models. Building on our theoretical results, we propose a practical algorithm facilitating the acquisition of reliable latent causal representations. Our algorithm, guided by our underlying theory, has demonstrated outstanding performance across a diverse range of synthetic and real-world datasets. The empirical observations closely align with the theoretical findings, affirming the robustness and effectiveness of our proposed approach.

551GROD: Enhancing Generalization of Transformer with Out-of-Distribution Detection

[openreview] [pdf]

Abstract Transformer networks excel in natural language processing (NLP) and computer vision (CV) tasks. However, they face challenges in generalizing to Out-of-Distribution (OOD) datasets, that is, data whose distribution differs from that seen during training. The OOD detection aims to distinguish data that deviates from the expected distribution, while maintaining optimal performance on in-distribution (ID) data. This paper introduces a novel approach based on OOD detection, termed the Generate Rounded OOD Data (GROD) algorithm, which significantly bolsters the generalization performance of transformer networks across various tasks. GROD is motivated by our new OOD detection Probably Approximately Correct (PAC) Theory for transformer. The transformer has learnability in terms of OOD detection that is, when the data is sufficient the outlier can be well represented. By penalizing the misclassification of OOD data within the loss function and generating synthetic outliers, GROD guarantees learnability and refines the decision boundaries between inlier and outlier. This strategy demonstrates robust adaptability and general applicability across different data types. Evaluated across diverse OOD detection tasks in NLP and CV, GROD achieves SOTA regardless of data format. The code is available athttps://anonymous.4open.science/r/GROD-OOD-Detection-with-transformers-B70F.

552Memory retaining finetuning via distillation

[openreview] [pdf]

Abstract Large language models (LLMs) pretrained on large corpora of internet text possess much of the world knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities such as math and coding, or “posttraining” (a.k.a., alignment) techniques to make the models follow users’ instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models’ own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuing by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model’s predictions to those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model’s performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition why label annealing would help.

553Multi-aspect Knowledge Distillation with Large Language Model

[openreview] [pdf]

Abstract Recent advancements in deep learning have significantly improved performance on computer vision tasks. Previous image classification methods primarily modify model architectures or add features, and they optimize models using cross-entropy loss on class logits. Since they focus on classifying images with considering class labels, these methods may struggle to learn various aspects of classes (e.g., natural positions and shape changes). In contrast, humans classify images by naturally referring to multi-aspects such as context, shape, color, and other features. Inspired by this, rethinking the previous approach from a novel view, we propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs). Our approach involves: 1) querying Large Language Model with multi-aspect questions relevant to the knowledge we want to transfer to the model, 2) extracting corresponding logits from MLLM, and 3) expanding the model’s output dimensions to distill these multi-aspect logits. We then apply cross-entropy loss to class logits and binary cross-entropy loss to multi-aspect logits. Through our method, the model can learn not only the knowledge about visual aspects but also the abstract and complex aspects that require a deeper understanding. We primarily apply our method to image classification, and to explore the potential for extending our model, we expand it to other tasks, such as object detection. In all experimental results, our method improves the performance of the baselines. Additionally, we analyze the effect of multi-aspect knowledge distillation. These results demonstrate that our method can transfer knowledge about various aspects to the model and the aspect knowledge can enhance model performance in computer vision tasks. This paper demonstrates the great potential of multi-aspect knowledge distillation, and we believe it offers a promising direction for future research in computer vision and beyond.

554Rethinking the Bias of Foundation Model under Long-tailed Distribution

[openreview] [pdf]

Abstract Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances affect long-tailed downstream tasks. Specifically, we refer to the biases in foundation models and downstream tasks as parameter imbalance and data imbalance, respectively. Through fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we constitute a causal structure graph and view the partial semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Experimental results validate the effectiveness of our method.

555100 instances is all you need: predicting LLM success by testing on a few instances

[openreview] [pdf]

Abstract Predicting if LLMs will succeed on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, we can evaluate a LLM on a set of instances and train an “assessor” to predict its performance. However, this requires evaluating each new LLM on sufficiently many instances. In this work, we build a “generic assessor” predicting the performance of any LLM on an instance by using the LLM’s performance on a small set of reference instances and the features of the considered instance. In practice, we make use of existing evaluation results to extract the representative instances and train the assessor. Thus, the performance of a new LLM can be predicted by only testing it on the reference instances, leveraging the information contained in other LLMs’ evaluations. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a new collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until gpt4-0125-preview\texttt{gpt4-0125-preview}. We find that a few instances (around 100) are enough to achieve predictive power comparable to the LLM-specific assessors trained on the complete set of several thousand instances. Interestingly, randomly selecting the reference instances performs comparably to the advanced selection methods we tested. Finally, we identify a sharp drop in the predictive power of the generic and specific assessors in out-of-distribution scenarios, suggesting that the inherent predictability of LLMs is low.

556Aggregation of Multi Diffusion Models for Enhancing Learned Representations

[openreview] [pdf]

Abstract Diffusion models have achieved remarkable success in image generation, particularly with the various applications of classifier-free guidance conditional diffusion models. While many diffusion models perform well when controlling for particular aspect among style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel algorithm, Aggregation of Multi Diffusion Models (AMDM), which synthesizes features from multiple diffusion models into a specified model, enhancing its learned representations to activate specific features for fine-grained control. AMDM consists of two key components: spherical aggregation and manifold optimization. Spherical aggregation merges intermediate variables from different diffusion models with minimal manifold deviation, while manifold optimization refines these variables to align with the intermediate data manifold, enhancing sampling quality. Experimental results demonstrate that AMDM significantly improves fine-grained control without additional training or inference time, proving its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional control generation in diffusion models: We can fully utilize existing conditional diffusion models that control specific aspects, or develop new ones, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at:https://github.com/Hammour-steak/AMDM

557OASIS Uncovers: High-Quality T2I Models, Same Old Stereotypes

[openreview] [pdf]

Abstract Images generated by text-to-image (T2I) models often exhibit visual biases and stereotypes of concepts such as culture and profession. Existing quantitative measures of stereotypes are based on statistical parity that does not align with the sociological definition of stereotypes and, therefore, incorrectly categorizes biases as stereotypes. Instead of oversimplifying stereotypes as biases, we propose a quantitative measure of stereotypes that aligns with its sociological definition. We then propose OASIS to measure the stereotypes in a generated dataset and understand their origins within the T2I model. OASIS includes two scores to measure stereotypes from a generated image dataset:(M1)Stereotype Score to measure the distributional violation of stereotypical attributes, and(M2)WALS to measure spectral variance in the images along a stereotypical attribute. OASIS also includes two methods to understand the origins of stereotypes in T2I models:(U1)StOP to discover attributes that the T2I model internally associates with a given concept, and(U2)SPI to quantify the emergence of stereotypical attributes in the latent space of the T2I model during image generation. Despite the considerable progress in image fidelity, using OASIS, we conclude that newer T2I models such as FLUX.1 and SDv3 contain strong stereotypical predispositions about concepts and still generate images with widespread stereotypical attributes. Additionally, the quantity of stereotypes worsens for nationalities with lower Internet footprints.

558Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

[openreview] [pdf]

Abstract In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO’s effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that `simplicity’ in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO’s advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO’s superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks.

559Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework

[openreview] [pdf]

Abstract There has been a recent push of research on Transformer-based models for long-term time series forecasting, even though they are inherently difficult to interpret and explain. While there is a large body of work on interpretability methods for various domains and architectures, the interpretability of Transformer-based forecasting models remains largely unexplored. To address this gap, we develop a framework based on Concept Bottleneck Models to enforce interpretability of time series Transformers. We modify the training objective to encourage a model to develop representations similar to predefined interpretable concepts. In our experiments, we enforce similarity using Centered Kernel Alignment, and the predefined concepts include time features and an interpretable, autoregressive surrogate model (AR). We apply the framework to the Autoformer model, and present an in-depth analysis for a variety of benchmark tasks. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability. Additionally, interpretable concepts become local, which makes the trained model easily intervenable. As a proof of concept, we demonstrate a successful intervention in the scenario of a time shift in the data, which eliminates the need to retrain.

560KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

[openreview] [pdf]

Abstract Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user’s specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

561Bias Mitigation in Graph Diffusion Models

[openreview] [pdf]

Abstract Most existing graph generative diffusion models suffer from significant exposure bias during graph sampling. We observe that the forward diffusion’s maximum perturbation distribution in most models deviates from the standard normal distribution, while reverse sampling consistently starts from a standard normal distribution. This mismatch results in a reverse starting bias, which, together with the exposure bias, degrades generation quality. The exposure bias typically accumulates and propagates throughout the sampling process. In this paper, we effectively address both biases. To mitigate reverse starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse starting point. To address the exposure bias, we introduce a fraction correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.

562Forward Learning with Differential Privacy

[openreview] [pdf]

Abstract Differential privacy (DP) in deep learning is a critical concern as it ensures the confidentiality of training data while maintaining model utility. Existing DP training algorithms provide privacy guarantees by clipping each individual backpropagated gradient and then injecting noise. Different from backpropagation, forward-learning algorithms based on perturbation inherently utilize randomness to estimate the gradient of each sample in parallel. These algorithms offer high parallelizability, suitability for non-differentiable modules, and applicability in black-box settings. Moreover, the introduction of noise during the forward pass indirectly provides randomness protection to the model parameters and their gradients, suggesting its potential for naturally providing differential privacy. In this paper, we propose a forward-learning algorithm, Differential Private Unified Likelihood Ratio method (DP-ULR), and demonstrate its differential privacy guarantees. DP-ULR features a novel batch sampling operation with rejection, which we theoretically analyze in conjunction with classic differential privacy mechanisms. DP-ULR is also underpinned by a theoretically guided privacy controller that dynamically adjusts noise levels to manage privacy costs effectively in each training step. Our experiments indicate that DP-ULR achieves competitive performance compared to traditional differential privacy training algorithms based on backpropagation, maintaining the same privacy loss limits.

563Bridging Jensen Gap for Max-Min Group Fairness Optimization in Recommendation

[openreview] [pdf]

Abstract Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model’s convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using three large-scale RS backbone models on two publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness.

564Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

[openreview] [pdf]

Abstract Off-policy learning and evaluation scenarios leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator’s bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded (1+ϵ)(1+\epsilon)-th moment of weighted reward. Notably, we achieve a convergence rate of O(nϵ/(1+ϵ))O(n^{-\epsilon/(1+\epsilon)}), where nn is the number of training samples for the regret bounds. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach.

565Longhorn: State Space Models are Amortized Online Learners

[openreview] [pdf]

Abstract The most fundamental capability of modern AI methods such as Large Language Models (LLMs) is the ability to predict the next token in a long sequence of tokens, known as “sequence modeling.” Although the Transformers model is the current dominant approach to sequence modeling, its quadratic computational cost with respect to sequence length is a significant drawback. State-space models (SSMs) offer a promising alternative due to their linear decoding efficiency and high parallelizability during training. However, existing SSMs often rely on seemingly ad hoc linear recurrence designs. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from optimizing these objectives. Based on this insight, we introduce a novel deep SSM architecture based on the implicit update for optimizing an online regression objective. Our experimental results show that our models outperform state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks and language modeling tasks.

566Following the Human Thread in Social Navigation

[openreview] [pdf]

Abstract The success of collaboration between humans and robots in shared environments relies on the robot’s real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human trajectories emerge as crucial cues in Social Navigation, but they are partially observable from the robot’s egocentric view and computationally complex to process.We present the first Social Dynamics Adaptation model (SDA) based on the robot’s state-action history to infer the social dynamics. We propose a two-stage Reinforcement Learning framework: the first learns to encode the human trajectories into social dynamics and learns a motion policy conditioned on this encoded information, the current status, and the previous action. Here, the trajectories are fully visible, i.e., assumed as privileged information. In the second stage, the trained policy operates without direct access to trajectories. Instead, the model infers the social dynamics solely from the history of previous actions and statuses in real-time. Tested on the novel Habitat 3.0 platform, SDA sets a novel state-of-the-art (SotA) performance in finding and following humans.The code will be released upon acceptance.

567Open-World Reinforcement Learning over Long Short-Term Imagination

[openreview] [pdf]

Abstract Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be “short-sighted”, as they are typically trained on short snippets of imagined experiences. We argue that the primary obstacle in open-world decision-making is improving the efficiency of off-policy exploration across an extensive state space. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model\textit{long short-term world model}. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.

568Inverse Constitutional AI: Compressing Preferences into Principles

[openreview] [pdf]

Abstract Feedback data is widely used to align or evaluate state-of-the-art AI models according to human preferences. Pairwise text preferences, where human (or AI) annotators select the “better” of two options, are particularly common. This data is typically used to train reward models or to compute aggregate statistics, asserting one model to be “better” than another. For many applications, however, it is desirable to understand human preferences in addition to modeling them. Neither black-box reward models nor statistics can answer why one model is better than another. Pairwise preference datasets, therefore, pose an interpretability challenge. The raw data consists of numerous (long) response pairs that are often infeasible to interpret manually. Prior work has demonstrated that human-annotated preference data often exhibits unintended biases, underscoring the urgent need for good interpretability tools to detect and alleviate such biases. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on a variety of datasets: (a) synthetic feedback data with known underlying principles; (b) AlpacaEval data with cross-annotated human feedback; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases — they may help identify undesirable annotator biases, better understand model performance, scale feedback to unseen data, or assist with adapting LLMs to individual user or group preferences. We release the code for our experiments athidden url.

569FullDiffusion: Diffusion Models Without Time Truncation

[openreview] [pdf]

Abstract Diffusion models are predominantly used for generative modeling, which synthesize samples by simulating the reverse process of a stochastic differential equation (SDE) that diffuses data into Gaussian noise. However, when simulating the reverse SDE, the SDE solver suffers from numerical instability near the time boundary; hence, in practice, the simulation is terminated before reaching the boundary point. This heuristic time truncation hinders the rigorous formulation of diffusion models, and requires additional costs of hyperparameter tuning. Moreover, such numerical instability often occurs even in training, especially when using a maximum likelihood loss. Therefore, the current diffusion model heavily relies on the time truncation technique in both training and inference. In this paper, we propose a method that completely eliminates the heuristic of time truncation. Our method eliminates numerical instability during maximum likelihood training by modifying the parameterization of the noise predictor and the noise schedule. We also propose a novel SDE solver that can simulate without time truncation by taking advantage of the semi-linear structure of the reverse SDE. These improvements enable stable training and sampling of diffusion models without relying on time truncation. In our experiments, we tested the effectiveness of our method on the CIFAR-10 and ImageNet-32 datasets by evaluating the test likelihood and the sample quality measured by the Fréchet inception distance (FID). We observe that our method consistently improve performance in both test likelihood and the FID compared to the baseline model of DDPM++.

570Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

[openreview] [pdf]

Abstract Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling, reducing the inference cost significantly. To achieve this, MoDE uses sparse experts combined with a novel routing strategy that conditions the expert selection on the current noise level of the denoising process. This is combined with a noise-conditioned self-attention mechanism for further improvements. MoDE achieves state-of-the-art performance across 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). It surpasses both CNN-based and Transformer Diffusion Policies by an average of 20% in all settings, while using 40% fewer FLOPs and fewer active parameters. Furthermore, we conduct comprehensive ablations on MoDE’s components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies.

571Rainbow Generator: Generating Diverse Data for Name Only Continual Learning

[openreview] [pdf]

Abstract Requiring extensive human supervision is often impractical for continual learning due to its cost, leading to the emergence of ‘name-only continual learning’ that only provides the name of new concepts (e.g., classes) without providing supervised samples. To address the task, recent approach uses web-scraped data but results in issues such as data imbalance, copyright, and privacy concerns. To overcome the limitations of both human supervision and webly supervision, we propose Generative name only Continual Learning (GenCL) using generative models for the name only continual learning. But naïve application of generative models results in limited diversity of generated data. So, we specifically propose a diverse prompt generation method, HIerarchical Recurrent Prompt Generation (HIRPG) as well as COmplexity-NAvigating eNsembler (CONAN) that selects samples with minimal overlap from multiple generative models. We empirically validate that the proposed GenCL outperforms prior arts, even a model trained with fully supervised data, in various tasks including image recognition and multi-modal visual reasoning. Data generated by GenCL is available athttps://anonymous.4open.science/r/name-only-continual-E079.

572Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

[openreview] [pdf]

Abstract Aligning AI systems with human preferences typically suffers from the infamousreward hackingproblem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization (PO), which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less desirable. We prove that many (mainstream or theoretical) PO methods suffer from both types of reward hacking. To address Type I Reward Hacking, we propose POWER, a new PO method that combines Guiaus’s Weighted Entropy with a Robust Reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To address Type II Reward Hacking, we analyze the learning dynamics of POWER and combine it with a novel technique that dynamically updates preference labels (POWER-DL) toward certain “stationary labels”, resulting in diminishing gradients for untrustworthy samples. Empirically, POWER-DL consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to13.0points on AlpacaEval 2 and11.5points on Arena Hard over DPO. Strong theoretical guarantees and empirical performance demonstrate the promise of POWER-DL in mitigating reward hacking.

573Federated Learning Can Find Friends That Are Advantageous

[openreview] [pdf]

Abstract In Federated Learning (FL), the distributed nature and heterogeneity of client data present both opportunities and challenges. While collaboration among clients can significantly enhance the learning process, not all collaborations are beneficial; some may even be detrimental. In this study, we introduce a novel algorithm that assigns adaptive aggregation weights to clients participating in FL training, identifying those with data distributions most conducive to a specific learning objective. We demonstrate that our aggregation method converges no worse than the method that aggregates only the updates received from clients with the same data distribution. Furthermore, empirical evaluations consistently reveal that collaborations guided by our algorithm outperform traditional FL approaches. This underscores the critical role of judicious client selection and lays the foundation for more streamlined and effective FL implementations in the coming years.

574STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning

[openreview] [pdf]

Abstract Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the “sub”-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.

575Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

[openreview] [pdf]

Abstract Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no- regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

576Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning

[openreview] [pdf]

Abstract We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions with a Bayesian inverse planning model that computes posterior probabilities for an agent’s likely mental states given its actions. Bayesian inverse planning models can accurately predict human reasoning on a variety of ToM tasks, but these models are constrained in their ability to scale these predictions to scenarios with a large number of possible hypotheses and actions. Conversely, LLM-based approaches have recently demonstrated promise in solving ToM benchmarks, but can exhibit brittleness and failures on reasoning tasks even when they pass otherwise structurally identical versions. By combining these two methods, our approach leverages the strengths of each component, closely matching optimal results on a task inspired by prior inverse planning models and improving performance relative to models that utilize LLMs alone or with chain-of-thought prompting. We also exhibit the model’s potential to predict mental states on open-ended tasks, offering a promising direction for future development of ToM models and the creation of socially intelligent generative agent models.

577In Search of Forgotten Domain Generalization

[openreview] [pdf]

Abstract Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model’s OOD performance were designed to be strictly OOD with respect to style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test domain contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION---LAION-Natural and LAION-Rendition---that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from the ImageNet era still prevail and that training on web-scale data merely creates the illusion of OOD generalization. Furthermore, through a systematic exploration of combining natural and rendition datasets in varying proportions, we identify optimal mixing ratios for model generalization across these domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale---a crucial prerequisite for improving model robustness.

578Test-Time Training for Out-of-Distribution Industrial Anomaly Detection via Robust Distribution Alignment

[openreview] [pdf]

Abstract Detecting anomalous patterns is essential for quality control in industrial applications, with state-of-the-art methods relying on large defect-free datasets to model normal distributions. However, robustness under domain shift, such as changes in lighting or sensor drift, remains a critical challenge in real-world deployment. An existing work, Generalized Normality Learning (GNL), addresses domain shifts by enforcing feature consistency through training-time augmentation, but its reliance on prior knowledge of target distributions and access to training data at inference limits flexibility. To overcome these limitations, we propose a memory bank-based anomaly detection method that avoids retraining or access to training data during inference. We improve the robustness to distribution shifts via distribution alignment based test-time training. Our approach leverages a modified Sinkhorn distance to align distributions and handle outliers, offering a more resilient solution for industrial anomaly detection under realistic constraints. Extensive evaluations on out-of-distribution anomaly detection benchmarks demonstrate the effectiveness.

579A Theoretical Perspective: When and How Self-consuming Training Loops Generalize

[openreview] [pdf]

Abstract High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion ofrecursive stabilityand presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

580Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior

[openreview] [pdf]

Abstract Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512×512 ImageNet images, revealing that: 1) DPS’s conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS’s conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS’s performance. The source code for these improvements is provided in the supplementary material.

581Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization

[openreview] [pdf]

Abstract Existing multi-objective multi-armed bandit (MO-MAB) approaches mainly focus on achieving Pareto optimality. However, a Pareto optimal arm that receives a high score from one user may lead to a low score from another, since in real-world scenarios, users often have diverse preferences across different objectives. Instead, these preferences should informcustomized learning, a factor usually neglected in prior research. To address this need, we study apreference-awareMO-MAB framework in the presence of explicit user preferences, where each user’s overall-reward is modeled as the inner product of user preference and arm reward. This new framework shifts the focus from merely achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To the best of our knowledge, this is the first theoretical exploration of customized MO-MAB optimization based on explicit user preferences. This framework introduces new and unique challenges for algorithm design for customized optimization. To address these challenges, we incorporatepreference estimationandpreference-aware optimizationas key mechanisms for preference adaptation, and develop new analytical techniques to rigorously account for the impact of preference estimation errors on overall performance. Under this framework, we consider three preference structures inspired by practical applications, with tailored algorithms that are proven to achieve near-optimal regret, and show good numerical performance.

582Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

[openreview] [pdf]

Abstract We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm—whose analysis has surprisingly eluded the literature despite the algorithm’s simplicity—demonstrating provable separation between model-free and model-based OPE in POMDPs.

583Diffusion Attribution Score: Which Training Sample Determines Your Generation?

[openreview] [pdf]

Abstract As diffusion models advance, the scientific community is actively developing methods to curb the misuse of generative models, which aims to prevent the reproduction of copyrighted, explicitly violent, or personally sensitive information in generated images. One strategy is to identify the contribution of training samples in generative models by evaluating their influence to the generated images, a task known as data attribution. Existing data attribution approaches on diffusion models suggest representing the contribution of a specific training sample by evaluating the change in the diffusion loss when the sample is included versus excluded from the training process. However, we argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the diffusion loss calculation. Specifically, these approaches measure the divergence between predicted and ground truth distributions, which leads to an indirect comparison between the predicted distributions and cannot represent the variances between model behaviors. To address these issues, we aim to measure the direct comparison between predicted distributions with an attribution score to analyse the training sample importance, which is achieved by Diffusion Attribution Score (DAS). Underpinned by rigorous theoretical analysis, we elucidate the effectiveness of DAS. Additionally, we explore strategies to accelerate DAS calculations, facilitating its application to large-scale diffusion models. Our extensive experiments across various datasets and diffusion models demonstrate that DAS significantly surpasses previous benchmarks in terms of the linear data-modelling score, establishing new state-of-the-art performance. Code is available athttps://anonymous.4open.science/r/Diffusion-Attribution-Score-411F.

584Comparing Targeting Strategies for Maximizing Social Welfare with Limited Resources

[openreview] [pdf]

Abstract Machine learning is increasingly used to select which individuals receive limited resource interventions in domains such as human services, education, development, and more. However, it is often not apparent what the right quantity is for models to predict. In particular, policymakers rarely have access to data from a randomized controlled trial (RCT) that would enable accurate estimates of treatment effects – which individuals would benefit more from the intervention. Observational data is more likely to be available, creating a substantial risk of bias in treatment effect estimates. Practitioners instead commonly use a technique termed “risk-based targeting” where the model is just used to predict each individual’s status quo outcome (an easier, non-causal task). Those with higher predicted risk are offered treatment. There is currently almost no empirical evidence to inform which choices lead to the most effect machine learning-informed targeting strategies in social domains. In this work, we use data from 5 real-world RCTs in a variety of domains to empirically assess such choices. We find that risk-based targeting is almost always inferior to targeting based on even biased estimates of treatment effects. Moreover, these results hold even when the policymaker has strong normative preferences for assisting higher-risk individuals. Our results imply that, despite the widespread use of risk prediction models in applied settings, practitioners may be better off incorporating even weak evidence about heterogeneous causal effects to inform targeting.

585Efficient and Accurate Explanation Estimation with Distribution Compression

[openreview] [pdf]

Abstract Exact computation of various machine learning explanations requires numerous model evaluations and in extreme cases becomes impractical. The computational cost of approximation increases with an ever-increasing size of data and model parameters. Many heuristics have been proposed to approximate post-hoc explanations efficiently. This paper shows that the standard i.i.d. sampling used in a broad spectrum of algorithms for explanation estimation leads to an approximation error worthy of improvement. To this end, we introduce compress then explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. We show that CTE improves the estimation of removal-based local and global explanations with negligible computational overhead. It often achieves an on-par explanation approximation error using 2-3x fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple yet powerful plug-in for any explanation method that now relies on i.i.d. sampling.

586Adaptive Priors from Learning Trajectories for Function-Space Bayesian Neural Networks

[openreview] [pdf]

Abstract Tractable Function-space Variational Inference (T-FVI) provides a way to estimate the function-space Kullback-Leibler (KL) divergence between a random prior function and its posterior. This allows the optimization of the function-space KL divergence via Stochastic Gradient Descent (SGD) and thus simplifies the training of function-space Bayesian Neural Networks (BNNs). However, function-space BNNs on high-dimensional datasets typically require deep neural networks (DNN) with numerous parameters, and thus defining suitable function-space priors remains challenging. For instance, the Gaussian Process (GP) prior suffers from scalability issues, and DNNs do not provide a clear way to set appropriate weight parameters to achieve meaningful function-space priors. To address this issue, we propose an explicit form of function-space priors that can be easily integrated into widely-used DNN architectures, while adaptively incorporating different levels of uncertainty based on the function’s inputs. To achieve this, we consider DNNs as Bayesian last-layer models to obtain the explicit mean and variance functions of our prior. The parameters of these explicit functions are determined using the weight statistics over the learning trajectory. Our empirical experiments show improved uncertainty estimation in image classification, transfer learning, and UCI regression tasks.

587Generalization Performance Gap Analysis between Centralized and Federated Learning: How to Bridge this Gap?

[openreview] [pdf]

Abstract The rising interest in decentralized data and privacy protection has led to the emergence of Federated Learning. Many studies have compared federated training with classical training approaches using centralized data and found from experiments that models trained in a federated setup with equal resources perform poorly on tasks. However, these studies have generally been empirical and have not explored the performance gap further from a theoretical perspective. The lack of theoretical understanding prevents figuring out whether federated algorithms are necessarily inferior to centralized algorithms in performance and how large this gap is according to the training settings. Also, it hinders identifying valid ways to close this performance distance. This paper fills this theoretical gap by formulating federated training as an SGD (Stochastic Gradient Descent) optimization problem over decentralized data and defining the performance gap within the PAC-Bayes (Probably Approximately Correct Bayesian) framework. Through theoretical analysis, we derive non-vacuous bounds on this performance gap, revealing that the difference in generalization performance necessarily exists when training resources are equal for both training setups and that variations in the training parameters affect the gap. Moreover, we also prove that the complete elimination of the performance gap is only possible by introducing new clients or adding new data to existing clients. Advantages in other training resources are not feasible for closing the gap, such as giving larger models or more communication rounds to federated scenarios. Our theoretical findings are validated by extensive experimental results from different model architectures and datasets.

588FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

[openreview] [pdf]

Abstract Neural Radiance Fields (NeRF) face significant challenges in few-shot scenarios, particularly due to overfitting and long training times for high-fidelity rendering. While current approaches like FreeNeRF and SparseNeRF use frequency regularization or pre-trained priors, they can be limited by complex scheduling or potential biases. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation training scheme that selects pseudo ground truth depth based on reprojection error from both training and novel views across scales. This guides training without relying on externally learned priors, allowing FrugalNeRF to fully utilize available data. While not dependent on pre-trained priors, FrugalNeRF can optionally integrate them for enhanced quality without affecting convergence speed. Our method generalizes effectively across diverse scenes and converges more rapidly than state-of-the-art approaches. Our experiments on standard LLFF, DTU, and RealEstate-10K datasets demonstrate that FrugalNeRF outperforms existing few-shot NeRF models, including those using pre-trained priors, while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction.

589Sampling Process Brings Additional Bias for Debiased Recommendation

[openreview] [pdf]

Abstract In recommender systems, selection bias arises from the users’ selective interactions with items, which poses a widely-recognized challenge for unbiased evaluation and learning for recommendation models. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models. However, if the users and items in the training set are not exactly the same as those in the test set, even if the imputed errors and learned propensities are accurate, all previous doubly robust based debiasing methods are biased. To tackle this problem, in this paper, we first derive the bias of doubly robust learning methods and provide alternative unbiasedness conditions when users and items are sampled from a superpopulation. Then we propose a novel superpopulation doubly robust target learning approach (SuperDR), which is unbiased when either the imputation model or propensity model is correctly specified. We further derive the generalization error bound of the proposed method under superpopulation, and show that it can be effectively controlled by the proposed target learning approach. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to demonstrate the effectiveness of our method.

590Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

[openreview] [pdf]

Abstract Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pretrained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

591High dimensional Bayesian Optimization via Condensing-Expansion Projection

[openreview] [pdf]

Abstract In high-dimensional settings, Bayesian optimization (BO) can be expensive and infeasible. The random embedding Bayesian optimization algorithm is commonly used to address high-dimensional BO challenges. However, this method relies on the effective subspace assumption on the optimization problem’s objective function, which limits its applicability. In this paper, we introduce Condensing-Expansion Projection Bayesian optimization (CEPBO), a novel random projection-based approach for high-dimensional BO that does not reply on the effective subspace assumption. The approach is both simple to implement and highly practical. We present two algorithms based on different random projection matrices: the Gaussian projection matrix and the hashing projection matrix. Experimental results demonstrate that both algorithms outperform existing random embedding-based algorithms in most cases, achieving superior performance on high-dimensional BO problems. The code is available in \url{https://anonymous.4open.science/r/CEPBO-14429}.

592Learning system dynamics without forgetting

[openreview] [pdf]

Abstract Observation-based trajectory prediction for systems with unknown dynamics is essential in fields such as physics and biology. Most existing approaches are limited to learning within a single system with fixed dynamics patterns. However, many real-world applications require learning across systems with evolving dynamics patterns, a challenge that has been largely overlooked. To address this, we systematically investigate the problem of Continual Dynamics Learning (CDL), examining task configurations and evaluating the applicability of existing techniques, while identifying key challenges. In response, we propose the Mode-switching Graph ODE (MS-GODE) model, which integrates the strengths LG-ODE and sub-network learning with a mode-switching module, enabling efficient learning over varying dynamics. Moreover, we construct a novel benchmark of biological dynamic systems for CDL, Bio-CDL, featuring diverse systems with disparate dynamics and significantly enriching the research field of machine learning for dynamic systems. Our code and benchmark datasets will be publicly available.

593Expected Sliced Transport Plans

[openreview] [pdf]

Abstract The optimal transport (OT) problem has gained significant traction in modern machine learning for its ability to: (1) provide versatile metrics, such as Wasserstein distances and their variants, and (2) determine optimal couplings between probability measures. To reduce the computational complexity of OT solvers, methods like entropic regularization and sliced optimal transport have been proposed. The sliced OT framework improves efficiency by comparing one-dimensional projections (slices) of high-dimensional distributions. However, despite their computational efficiency, sliced-Wasserstein approaches lack a transportation plan between the input measures, limiting their use in scenarios requiring explicit coupling. In this paper, we address two key questions: Can a transportation plan be constructed between two probability measures using the sliced transport framework? If so, can this plan be used to define a metric between the measures? We propose a ‘lifting’ operation to extend one-dimensional optimal transport plans back to the original space of the measures. By computing the expectation of these lifted plans, we derive a new transportation plan, termed expected sliced transport (EST) plans. We further prove that using the EST plan to weight the sum of the individual Euclidean costs xyp|x - y|^p for moving from xx to yy results in a valid metric between the input discrete probability measures. Finally, we demonstrate the connection between our approach and the recently proposed min-SWGG, along with illustrative numerical examples that support our theoretical findings.

594LEARN TO LEARN CONSISTENTLY

[openreview] [pdf]

Abstract In the few-shot learning problem, a model trained on a disjoint meta-train dataset is required to address novel tasks with limited novel examples. A key challenge in few-shot learning is the model’s propensity to learn biased shortcut features(e.g., background, noise, shape, color), which are sufficient to distinguish the few ex- amples during fast adaptation but lead to poor generalization. In our work, we observed when the model learns with higher consistency, the model tends to be less influenced by shortcut features, resulting in better generalization. Based on the observation, we propose a simple yet effective meta-learning method named Meta Self-Distillation. By maximizing the consistency of the learned knowledge during the meta-train phase, the model initialized by our method shows better generalization in the meta-test phase. Extensive experiments demonstrate that our method improves the model’s generalization across various few-shot classification scenarios and enhances the model’s ability to learn consistently.

595RecFlow: An Industrial Full Flow Recommendation Dataset

[openreview] [pdf]

Abstract Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real-world industrial RS, they face a critical challenge: handling unexposed items—a significantly larger space than the exposed one. This discrepancy profoundly impacts their practical performance. Additionally, these algorithms often overlook the intricate interplay between multiple RS stages, resulting in suboptimal overall system performance. To address this issue, we introduce RecFlow—an industrial full-flow recommendation dataset designed to bridge the gap between offline RS benchmarks and the real online environment. Unlike existing datasets, RecFlow includes samples not only from the exposure space but also unexposed items filtered at each stage of the RS funnel. Our dataset comprises 38M interactions from 42K users across nearly 9M items with additional 1.9B stage samples collected from 9.3M online requests over 37 days and spanning 6 stages. Leveraging the RecFlow dataset, we conduct courageous exploration experiments, showcasing its potential in designing new algorithms to enhance effectiveness by incorporating stage-specific samples. Some of these algorithms have already been deployed online, consistently yielding significant gains. We propose RecFlow as the first comprehensive benchmark dataset for the RS community, supporting research on designing algorithms at any stage, study of selection bias, debiased algorithms, multi-stage consistency and optimality, multi-task recommendation, and user behavior modeling. The RecFlow dataset, along with the corresponding source code, is publicly available at \textcolor{red}{\url{https://github.com/RecFlow-ICLR/RecFlow}}. The dataset is licensed under CC-BY-NC-SA-4.0 International License.

596OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?

[openreview] [pdf]

Abstract Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. A multitude of learning algorithms exist and each can improve performance inspecificOOD situations. We posit that much of the challenge of OOD generalization lies inchoosing the right algorithm for the right dataset. However, such algorithm selection is often elusive under complex real-world shifts. In this work, we formalize the task ofalgorithm selection for OOD generalizationand investigate whether it could be approached by learning.We propose a solution, dubbed OOD-Chameleon that treats the task as a supervised classification over candidate algorithms. We construct adataset of datasetsto learn from, which represents diverse types, magnitudes and combinations of shifts (covariate shift, label shift, spurious correlations). We train the model to predict the relative performance of algorithms given a dataset’s characteristics. This enablesa prioriselection of the best learning strategy, i.e. without training various models as needed with traditional model selection.Our experiments show that the adaptive selection outperforms any individual algorithm and simple selection heuristics, on unseen datasets of controllable and realistic image data. Inspecting the model shows that it learns non-trivial data/algorithms interactions, and reveals the conditions for any one algorithm to surpass another. This opens new avenues for (1) enhancing OOD generalization with existing algorithms instead of designing new ones, and (2) gaining insights into the applicability of existing algorithms with respect to datasets’ properties.

597Breaking Free: Hacking Diffusion Models for Generating Adversarial Examples and Bypassing Safety Guardrails

[openreview] [pdf]

Abstract Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on synthetically altering the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorithmic framework that uses auxiliary Conditional Diffusion and Classifier models to generate photo-realistic natural adversarial samples. We employ CMA-ES to optimize the initial seed vector search, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Classifier Model. Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers. We also show that beyond generating adversarial images, EvoSeed can also be used as a red-teaming tool to understand classification systems’ misclassification. Our research opens new avenues for understanding the limitations of current safety mechanisms and the risk of plausible attacks against classifier systems using image generation.

598Imagine to Ensure Safety in Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract This work investigates the safety exploration problem, where an agent must maximize performance while satisfying safety constraints. To address this problem, we propose a method that includes a learnable world model and two policies, a high-level policy and a low-level policy, that ensure safety at both levels. The high-level policy generates safe subgoals for the low-level policy, which progressively guide the agent towards the final goal. Through trajectory imagination, the low-level policy learns to safely reach these subgoals. The proposed method was evaluated on the standard benchmark, SafetyGym, and demonstrated superior performance quality while maintaining comparable safety violations compared to state-of-the-art approaches. In addition, we investigated an alternative implementation of safety in hierarchical reinforcement learning (HRL) algorithms using Lagrange multipliers, and demonstrated in the custom long-horizon environment SafeAntMaze that our approach achieves comparable performance while more effectively satisfying safety constraints, while the flat safe policy fails to accomplish this task.

599Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration

[openreview] [pdf]

Abstract learning where the goal is to learn a policy that mimics the expert’s behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm namely Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance. Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than those in previous work. We also provide theoretical justification of ILDE as an uncertainty-regularized policy optimization method with optimistic exploration, leading to a regret growing sublinearly in the number of episodes.

600Inertial Confinement Fusion Forecasting via Large Language Models

[openreview] [pdf]

Abstract Controlled fusion energy is deemed pivotal for the advancement of human civilization. In this study, we introduce LPI-LLM\textbf{LPI-LLM}, a novel integration of Large Language Models (LLMs) with classical reservoir computing paradigms tailored to address a critical challenge, Laser-Plasma Instabilities (LPI\texttt{LPI}), in Inertial Confinement Fusion (ICF\texttt{ICF}). Our approach offers several key contributions: Firstly, we propose the LLM-anchored Reservoir\textit{LLM-anchored Reservoir}, augmented with a Fusion-specific Prompt\textit{Fusion-specific Prompt}, enabling accurate forecasting of LPI\texttt{LPI}-generated-hot electron dynamics during implosion. Secondly, we develop Signal-Digesting Channels\textit{Signal-Digesting Channels} to temporally and spatially describe the driver laser intensity across time, capturing the unique characteristics of ICF\texttt{ICF} inputs. Lastly, we design the Confidence Scanner\textit{Confidence Scanner} to quantify the confidence level in forecasting, providing valuable insights for domain experts to design the ICF\texttt{ICF} process. Extensive experiments demonstrate the superior performance of our method, achieving 1.90 CAE, 0.14 top-1\texttt{top-1} MAE, and 0.11 top-5\texttt{top-5} MAE in predicting Hard X-ray (HXR\texttt{HXR}) energies emitted by the hot electrons in ICF\texttt{ICF} implosions, which presents state-of-the-art comparisons against concurrent best systems. Additionally, we present LPI4AI\textbf{LPI4AI}, the first LPI\texttt{LPI} benchmark based on physical experiments, aimed at fostering novel ideas in LPI\texttt{LPI} research and enhancing the utility of LLMs in scientific exploration. Overall, our work strives to forge an innovative synergy between AI and ICF\texttt{ICF} for advancing fusion energy.

601Can a Bayesian oracle prevent harm from an agent?

[openreview] [pdf]

Abstract Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

602Pan for gold

[openreview] [pdf]

Abstract Training a deep model is fundamentally about reducing loss, and we often believe that a ‘‘good model’’ is one that trained with a ‘‘good loss.’’ This paper investigates that belief. We show that even when learning with unstructured, randomized labels, models can still discover generalized features. We propose that generalization in deep learning is not about learning the structure of data through a well-structured loss, but rather a process akin to ‘‘pan for gold,’’ where gradient descent shakes through the function space, naturally stabilizing useful features. To support this, we present quantitative and qualitative experimental evidence, and introduce the Panning through Unstructured Label (PUL) algorithm. We demonstrate its effectiveness across various fields, showing improvements in unsupervised domain adaptation, state-of-the-art performance in object discovery, and its ability to mitigate massive attention issues. Finally, we offer a new interpretation of existing deep learning assumptions, challenging the conventional beliefs in the field.

603Synthetic Theorem Generation in Lean

[openreview] [pdf]

Abstract The application of large language models (LLMs) to theorem proving presents a promising avenue for advancing formal mathematics. Interactive theorem provers, such as Lean, offer a rigorous framework within which these models can assist in or automate proof discovery, grounding their reasoning capabilities in a sound, verifiable formal system. However, the potential of LLMs in this domain is constrained by the limited availability of formal proof corpora for training. To address this limitation, we introduce a synthetic theorem generator capable of producing novel Lean theorems and their corresponding proofs. Our approach employs forward reasoning to synthesize new propositions from premises drawn from existing Lean libraries. We explore candidate reasoning steps using a search strategy that optimizes for diversity of output, apply them in a linear fashion that avoids irrelevant proof steps, and assess their effect by meta-programmatically executing corresponding Lean tactics. These methods enable the generation of an arbitrary number of new theorems and proofs across various mathematical domains, using common Lean proof tactics while ensuring the correctness of generated theorems by construction. We demonstrate the efficacy of the generated theorems and training data by fine-tuning models on synthetic theorems and evaluating them on the miniF2F-test benchmark. Our results show improvements in theorem-proving capabilities, with accuracy increasing from 37.3% to 38.5% for the Falcon2-11B model trained solely on Mathlib, and from 38.1% to 39.3% for the same model trained on a mix of rich datasets. These improvements highlight the value of our diverse synthetic data in augmenting limited existing corpora of formal proofs, providing complementary information that enhances LLMs’ performance on theorem-proving tasks even when combined with other datasets.

604Do LLM Agents Have Regret? A Case Study in Online Learning and Games

[openreview] [pdf]

Abstract Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of regret. We first empirically study the no-regret behaviors of LLMs in canonical non-stochastic online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To further promote the no-regret behaviors, we propose a novel unsupervised training loss of regret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. Finally, we establish the statistical guarantee of generalization bound for regret-loss minimization, and more importantly, the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms, when single-layer self-attention models are used. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above “regrettable” cases.

605MissDiff: Training Diffusion Models on Tabular Data with Missing Values

[openreview] [pdf]

Abstract The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed training data. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work considers learning from data with missing values for missing value imputations and generating synthetic complete data in a unified framework. With minimal assumptions on the missing mechanisms, our method models the score of complete data distribution by denoising score matching on data with missing values. We prove that the proposed method can recover the score of the complete data distribution, and the proposed training objective serves as an upper bound for the negative likelihood of observed data. Extensive experiments on imputation tasks together with generation tasks demonstrate that our proposed framework outperforms existing state-of-the-art approaches on multiple tabular datasets.

606Unified Perspectives on Signal-to-Noise Diffusion Models

[openreview] [pdf]

Abstract Diffusion models (DM) have become essential components of generative modeling, demonstrating exceptional performance in domains like image synthesis, audio generation, and complex data interpolation. Signal-to-Noise diffusion models represent a broad family encompassing many state-of-the-art models. Although several efforts have been made to explore Signal-to-Noise (S2N) diffusion models from different angles, a comprehensive study that connects these viewpoints and introduces new insights is still needed. In this work, we provide an in-depth perspective on noise schedulers, analyzing their role through the lens of the signal-to-noise ratio (SNR) and its relationship to information theory. Based on this framework, we introduce a generalized backward equation to improve the efficiency of the inference process.

607Causally Motivated Diffusion Sampling Frameworks for Harnessing Contextual Bias

[openreview] [pdf]

Abstract Diffusion models have shown remarkable performance in text-guided image generation when trained on large-scale datasets, usually collected from the Internet. These large-scale datasets have contextual biases (e.g., co-occurrence of objects) which will naturally cascade into the diffusion model. For example, given a text prompt of ``a photo of the living room’', diffusion models frequently generate a couch, a rug, and a lamp together while rarely generating objects that do not commonly occur in a living room. Intuitively, contextual bias can be helpful because it naturally draws the scene even without detailed information (i.e., visual autofill). On the other hand, contextual bias can limit the diversity of generated images (e.g., diverse object combinations) to focus on common image compositions. To have the best of both worlds, we argue that contextual bias needs to be strengthened or weakened depending on the situation. Previous causally-motivated studies have tried to deal with such issues by analyzing confounders (i.e., contextual bias) and augmenting training data or designing their models to directly learn the interventional distribution. However, due to the large-scale nature of these models, obtaining and analyzing the data or training the huge model from scratch is beyond reach in practice. To tackle this problem, we propose two novel frameworks for strengthening or weakening the contextual bias of pretrained diffusion models without training any parameters or accessing training data. Briefly, we first propose causal graphs to explicitly model contextual bias in the generation process. We then sample the hidden confounder due to contextual bias by sampling from a chain of pretrained large-scale models. Finally, we use samples from the confounder to strengthen or weaken the contextual bias based on methods from causal inference. Experiment results show that our proposed methods are effective in generating more realistic and diverse images than the regular sampling method.

608Critique-out-Loud Reward Models

[openreview] [pdf]

Abstract Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant’s response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

609Federated Learning in Streaming Subspace

[openreview] [pdf]

Abstract Federated learning (FL) has received widespread attention due to its distributed training and privacy protection. However, existing federated learning methods encounter significant challenges, such as increased communication costs and degraded model performance, when processing non-independently and identically distributed (non-IID) data. This paper jointly alleviates these problems by analyzing and exploiting the low-rank properties of global model trajectories.Primarily, we introduce a streaming subspace update strategy and then propose a general federated learning framework, F\textbf{F}erated L\textbf{L}earning in S\textbf{S}treaming S\textbf{S}ubspace (FLSS\texttt{FLSS}). In FLSS\texttt{FLSS}, local model updates are restricted to the global streaming subspace, resulting in low-dimensional trajectories. The server then aggregates these trajectories to update the global model. Comprehensive experiments verify the effectiveness of our framework. In Cifar100, the FLSS\texttt{FLSS}-equipped FL method outperforms the baseline by 2.14%\% and reduces the communication cost by 80%\%. FLSS\texttt{FLSS} utilizes the early training information of the global model to simultaneously improve the performance and communication efficiency of federated learning.

610Taming Transformer Without Using Learning Rate Warmup

[openreview] [pdf]

Abstract Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and an obviously lower learning rate, is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for training Transformer and reveal a key problem behind the model crash phenomenon in the training, \ie the \textit{spectral energy concentration} of WqWk{W_q}^{\top} W_k, which is the reason for a malignant entropy collapse. To remedy this problem, motivated by \textit{Weyl’s Inequality}, we present a novel optimization strategy---making weight updating in successive steps smooth, that is, if the ratio σ1(Wt)σ1(Wt1)\frac{\sigma_{1}(\nabla W_t)}{\sigma_{1}(W_{t-1})} is larger than a threshold, where \nabla \bW_t is the updating quantity in step tt, we will automatically bound the learning rate to a weighted multiply of σ1(Wt1)σ1(Wt)\frac{\sigma_{1}(W_{t-1})}{\sigma_{1}(\nabla W_t)}. Our optimization strategy is able to prevent the rapid spectral energy concentration to only a few directions, and thus is able to avoid the malignant entropy collapse that will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these (Transformer) models without using learning rate warmup.

611Reward Learning From Preference With Ties

[openreview] [pdf]

Abstract Reward learning plays a pivotal role in Reinforcement Learning from Human Feedback (RLHF), ensuring the alignment of language models. The Bradley-Terry (BT) model stands as the prevalent choice for capturing human preferences from datasets containing pairs of chosen and rejected responses. In preference modeling, the focus is not on absolute values but rather on the reward difference between chosen and rejected responses, referred to as preference strength. Thus, precise evaluation of preference strength holds paramount importance in preference modeling. However, an easily overlooked factor significantly affecting preference strength measurement is that human attitudes towards two responses may not solely indicate a preference for one over the other and ties are also a common occurrence. To address this, we propose the adoption of the generalized Bradley-Terry model -- the Bradley-Terry model with ties (BTT) -- to accommodate tied preferences, thus leveraging additional information. We prove that even with the access to the true distributions of prompt and response, disregarding ties can lead to a notable bias in preference strength measurement. Comprehensive experiments further validate the advantages of incorporating ties in preference modeling. Notably, fine-tuning with BTT significantly outperforms fine-tuning with BT on synthetic preference datasets with ties, labeled by state-of-the-art open-source LLMs.

612Enhancing Group Fairness in Federated Learning through Personalization

[openreview] [pdf]

Abstract Personalized Federated Learning (FL) algorithms collaboratively train customized models for each client, enhancing the accuracy of the learned models on the client’s local data (e.g., by clustering similar clients, by fine-tuning models locally, or by imposing regularization terms). In this paper, we investigate the impact of such personalization techniques on the group fairness of the learned models, and show that personalization can also lead to improved (local) fairness as an unintended benefit. We begin by illustrating these benefits of personalization through numerical experiments comparing several classes of personalized FL algorithms against a baseline FedAvg algorithm, elaborating on the reasons behind improved fairness using personalized FL, and then providing analytical support. Motivated by these, we then show how to build on this (unintended) fairness benefit, by further integrating a fairness metric into the cluster-selection procedure of clustering-based personalized FL algorithms, and improve the fairness-accuracy trade-off attainable through them. Specifically, we propose two new fairness-aware federated clustering algorithms, Fair-FCA and Fair-FL+HC, extending the existing IFCA and FL+HC algorithms, and demonstrate their ability to strike a (tuneable) balance between accuracy and fairness at the client level.

613Towards Understanding Text Hallucination of Diffusion Models via Local Generation Bias

[openreview] [pdf]

Abstract Score-based diffusion models have achieved incredible performance in generating realistic images, audio, and video data. While these models produce high-quality samples with impressive details, they often introduce unrealistic artifacts, such as distorted fingers or hallucinated texts with no meaning. This paper focuses on textual hallucinations, where diffusion models correctly generate individual symbols but assemble them in a nonsensical manner. Through experimental probing, we consistently observe that such phenomenon is attributed it to the network’s local generation bias. Denoising networks tend to produce outputs that rely heavily on highly correlated local regions, particularly when different dimensions of the data distribution are nearly pairwise independent. This behavior leads to a generation process that decomposes the global distribution into separate, independent distributions for each symbol, ultimately failing to capture the global structure, including underlying grammar. Intriguingly, this bias persists across various denoising network architectures including MLP and transformers which have the structure to model global dependency. These findings also provide insights into understanding other types of hallucinations, extending beyond text, as a result of implicit biases in the denoising models. Additionally, we theoretically analyze the training dynamics for a specific case involving a two-layer MLP learning parity points on a hypercube, offering an explanation of its underlying mechanism.

614Accelerated Diffusion using Closed-form Discriminator Guidance

[openreview] [pdf]

Abstract Diffusion models are a state-of-the-art generative modeling framework that transform noise to images via Langevin sampling, guided by the score, which is the gradient of the logarithm of the data distribution. Recent works have shown empirically that the generation quality can be improved when guided by classifier network, which is typically the discriminator trained in a generative adversarial network (GAN) setting. In this paper, we propose a theoretical framework to analyze the effect of the GAN discriminator on Langevin-based sampling, and show that in IPM GANs, the optimal generator matches {\it score-like} functions, involving the flow-field of the kernel associated with a chosen IPM constraint space. Further, we show that IPM-GAN optimization can be seen as one of smoothed score-matching, where the scores of the data and the generator distributions are convolved with the kernel associated with the constraint. The proposed approach serves to unify score-based training and optimization of IPM-GANs. Based on these insights, we demonstrate that closed-form discriminator guidance, using a kernel-based implementation, results in improvements (in terms of CLIP-FID and KID metrics) when applied atop baseline diffusion models. We demonstrate these results by applying closed-form discriminator guidance to denoising diffusion implicit model (DDIM) and latent diffusion model (LDM) settings on the FFHQ and CelebA-HQ datasets. We also demonstrate improvements to accelerated time-step-shifted diffusion, when coupled with a wavelet-based noise estimator for latent-space image generation.

615FreqPrior: Improving Diffusion Models with Frequency Filtering Gaussian Noise as Prior

[openreview] [pdf]

Abstract Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this context. However, it often generates videos that lack motion dynamics and imaging details. In this work, we provide a comprehensive theoretical analysis of the variance decay issue present in existing methods, contributing to the loss of details and motion dynamics. Recognizing the critical impact of noise distribution on generation quality, we introduce FreqPrior, a novel noise initialization strategy that refines noise in the frequency domain. Our method features a novel filtering technique designed to address different frequency signals while maintaining the noise prior distribution that closely approximates a standard Gaussian distribution. Additionally, we propose a partial sampling process by perturbing the latent at an intermediate timestep during finding the noise prior, significantly reducing inference time without compromising quality. Extensive experiments on VBench demonstrate that our method achieves the highest scores in both quality and semantic assessments, resulting in the best overall total score. These results highlight the superiority of our proposed noise prior.

616Going Beyond Static: Understanding Shifts with Time-Series Attribution

[openreview] [pdf]

Abstract Distribution shifts in time-series data are complex due to temporal dependencies, multivariable interactions, and trend changes. However, robust methods often rely on structural assumptions that lack thorough empirical validation, limiting their practical applicability. In order to support an empirically grounded inductive approach to research, we introduce our Time-Series Shift Attribution (TSSA) framework, which analyzes application-specific patterns of distribution shifts. Our framework attributes performance degradation from various types of shifts to eachtemporal data propertyin a detailed manner, supported by theoretical analysis of unbiasedness and asymptotic properties. Empirical studies in real-world healthcare applications highlight how the TSSA framework enhances the understanding of time-series shifts, facilitating reliable model deployment and driving targeted improvements from both algorithmic and data-centric perspectives.

617SelKD: Selective Knowledge Distillation via Optimal Transport Perspective

[openreview] [pdf]

Abstract Knowledge Distillation (KD) has been a popular paradigm for training a (smaller) student model from its teacher model. However, little research has been done on the practical scenario where only a subset of the teacher’s knowledge needs to be distilled, which we term selective KD (SelKD). This demand is especially pronounced in the era of foundation models, where the teacher model can be significantly larger than the student model. To address this issue, we propose to rethink the knowledge distillation problem from the perspective of Inverse Optimal Transport (IOT). Previous Bayesian frameworks mapped each sample to the probabilities of corresponding labels in an end-to-end manner, which fixed the number of classification categories and hindered effective local knowledge transfer. In contrast, IOT calculates from the standpoint of transportation or matching, allowing for the flexible selection of samples and their quantities for matching. Traditional logit-based KD can be viewed as a special case within the IOT framework. Building on this IOT foundation, we formalize this setting in the context of classification, where only selected categories from the teacher’s category space are required to be recognized by the student in the context of closed-set recognition, which we call closed-set SelKD, enhancing the student’s performance on specific subtasks. Furthermore, we extend the closed-set SelKD, introducing an open-set version of SelKD, where the student model is required to provide a ``not selected" response for categories outside its assigned task. Experimental results on standard benchmarks demonstrate the superiority of our approach.

618Subsampled Ensemble Can Improve Generalization Tail Exponentially

[openreview] [pdf]

Abstract Ensemble learning is a popular technique to improve the accuracy of machine learning models. It hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the best model trained on subsamples via majority voting, we can attain exponentially decaying tails for the excess risk, even if the base learner suffers from slow (i.e., polynomial) decay rates. This tail enhancement power of ensembling is agnostic to the underlying base learner and is stronger than variance reduction in the sense of exhibiting rate improvement. We demonstrate how our ensemble methods can substantially improve out-of-sample performances in a range of examples involving heavy-tailed data or intrinsically slow rates.

619Glauber Generative Model: Discrete Diffusion Models via Binary Classification

[openreview] [pdf]

Abstract We introduce the Glauber Generative Model (GGM), a new class of discrete diffusion models, to obtain new samples from a distribution given samples from a discrete space. GGM deploys a discrete Markov chain called the heat bath dynamics (or the Glauber dynamics) to denoise a sequence of noisy tokens to a sample from a joint distribution of discrete tokens. Our novel conceptual framework provides an exact reduction of the task of learning the denoising Markov chain to solving a class of binary classification tasks. More specifically, the model learns to classify a given token in a noisy sequence as signal or noise. In contrast, prior works on discrete diffusion models either solve regression problems to learn importance ratios, or minimize loss functions given by variational approximations. We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling.

620Training on the Test Task Confounds Evaluation and Emergence

[openreview] [pdf]

Abstract We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. Lastly, we show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

621Denoising Task Difficulty-based Curriculum for Training Diffusion Models

[openreview] [pdf]

Abstract Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with ascending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.

622DICE: Data Influence Cascade in Decentralized Learning

[openreview] [pdf]

Abstract Decentralized learning offers a promising approach to crowdsource computational workloads across geographically distributed compute interconnected through peer-to-peer networks, accommodating the exponentially increasing compute demands in the era of large models. However, the absence of proper incentives in locally connected decentralized networks poses significant risks of free riding and malicious behaviors. Data influence, which ensures fair attribution of data source contributions, holds great potential for establishing effective incentive mechanisms. Despite the importance, little effort has been made to analyze data influence in decentralized scenarios, due to non-trivial challenges arising from the distributed nature and the localized connections inherent in decentralized networks. To overcome this fundamental incentive problem, we propose DICE, the first comprehensive framework for analyzing Data Influence CascadEs in decentralized environments. Our framework characterizes how data influence cascades across the communication network and highlights the interplay between original data and network structure in shaping data influence in decentralized learning. We anticipate that DICE can open new avenues for incentive mechanism design and enable impactful applications of influence in decentralized learning, including anomaly detection, collaborator selection and machine unlearning.

623Breaking the Detection-Generalization Paradox on Out-Of-Distribution Data

[openreview] [pdf]

Abstract This work studies the trade-off between out-of-distribution (OOD) detection and generalization. We identify the Detection-Generalization Paradox in OOD data, where optimizing one objective can degrade the other. We investigate this paradox by analyzing the behaviors of models trained under different paradigms, focusing on representation, logits, and loss across in-distribution, covariate-shift, and semantic-shift data. Based on our findings, we propose Distribution-Robust Sharpness-Aware Minimization (DR-SAM), an optimization framework that balances OOD detection and generalization. Extensive experiments demonstrate the method’s effectiveness, offering a clear, empirically validated approach for improving detection and generalizationability in different benchmarks.

624Combining Analytical Smoothing with Surrogate Losses for Improved Decision-Focused Learning

[openreview] [pdf]

Abstract Many combinatorial optimization problems in routing, scheduling, and assignment involve parameters such as price or travel time that must be predicted from data; so-called predict-then-optimize (PtO) problems. Decision-focused learning (DFL) is a family of successful end-to-end techniques for PtO that trains machine learning models to minimize the error of the downstream optimization problems. For each instance, this requires computing the derivative of the optimization problem’s solution with respect to the predicted input parameters. Previous works in DFL employ two main approaches when the parameters appear linearly in the objective: (a) using a differentiable surrogate loss instead of regret; or (b) turning the combinatorial optimization problem into a differentiable mapping by smoothing the optimization to a quadratic program or other smooth convex optimization problem and minimizing the regret of that. We argue that while smoothing makes the optimization differentiable, for a large part, the derivative remains approximately zero almost everywhere, with highly non-zero values near the transition points. To address this plateau effect, we propose minimizing a surrogate loss even after smoothing. We experimentally demonstrate the advantage of minimizing surrogate losses instead of the regret after smoothing across a series of problems. Furthermore, we show that by minimizing a surrogate loss, a recently developed fast, fully neural optimization layer matches state-of-the-art performance while dramatically reducing training time up to five-fold. Thus, our paper opens new avenues for efficient and scalable DFL techniques.

625Outcome-based Semifactual Explanation For Reinforcement Learning

[openreview] [pdf]

Abstract Counterfactual explanations in reinforcement learning (RL) aim to answer what-if questions by showing sparse and minimal changes to states, which results in the probability mass moving from one action to another. Although these explanations are effective in classification tasks that look for the presence of concepts, RL brings new challenges that current counterfactual methods for RL still need to solve. These challenges include defining similarity in RL, out-of-distribution states, and lack of discriminative power. Given a state of interest called the query state, we solve these problems by asking how long the agent can execute the query state action without incurring a negative outcome regarding the expected return. We coin this outcome-based semifactual (OSF) explanation and find the OSF state by simulating trajectories from the query state. The last state in a subtrajectory where we can take the same action as in the query state without incurring a negative outcome is the OSF state. This state is discriminative, plausible, and similar to the query state. It abstracts away unimportant action switching with little explanatory value and shows the boundary between positive and negative outcomes. Qualitatively, we show that our method explains when it is necessary to switch actions. As a result, it is easier to understand the agent’s behavior. Quantitatively, we demonstrate that our method can increase policy performance and, at the same time, reduce how often the agent switches its action across six environments. The code and trained models are available athttps://anonymous.4open.science/r/osf-explanation-for-rl-E312/.

626Efficient Online Reinforcement Learning Fine-Tuning Should Not Retain Offline Data

[openreview] [pdf]

Abstract The modern paradigm in machine learning involves pre-training models on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a static dataset, followed by rapid online RL fine-tuning using autonomous interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. This is undesirable because retaining offline data is both slow and expensive for large datasets, but has been inevitable so far. In this paper, we show that retaining offline data is completely unnecessary as long as we use a correctly-designed online RL approach for fine-tuning offline RL initializations. We start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden unlearning of the offline RL value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. As a result, this unlearning erases the benefits of offline pre-training. Our approach, WSRL, mitigates this sudden unlearning by using a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy. The data collected during warmup helps ``recalibrate’’ the offline Q-function to the online data better, allowing us to completely discard offline data without risking of destabilizing the online RL training. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they do or do not retain offline data.

627Replay concurrently or sequentially? A theoretical perspective on replay in continual learning

[openreview] [pdf]

Abstract Replay-based methods have shown superior performance to address catastrophic forgetting in continual learning (CL), where a subset of past data is stored and generally replayed together with new data in current task learning. While seemingly natural, it is questionable, though rarely questioned, if such a concurrent replay strategy is always the right way for replay in CL. Inspired by the fact in human learning that revisiting very different courses sequentially before final exams is more effective for students, an interesting open question to ask is whether a sequential replay can benefit CL more compared to a standard concurrent replay. However, answering this question is highly nontrivial considering a major lack of theoretical understanding in replay-based CL methods. To this end, we investigate CL in overparameterized linear models and provide a comprehensive theoretical analysis to compare two replay schemes: 1) Concurrent Replay, where the model is trained on replay data and new data concurrently; 2) Sequential Replay, where the model is trained first on new data and then sequentially on replay data for each old task. By characterizing the explicit form of forgetting and generalization error, we show in theory that sequential replay tends to outperform concurrent replay when tasks are less similar, which is corroborated by our simulations in linear models. More importantly, our results inspire a novel design of a hybrid replay method, where only replay data of similar tasks are used concurrently with the current data and dissimilar tasks are sequentially revisited using their replay data. As depicted in our experiments on real datasets using deep neural networks, such a hybrid replay method improves the performance of standard concurrent replay by leveraging sequential replay for dissimilar tasks. By providing the first comprehensive theoretical analysis on replay, our work has great potentials to open up more principled designs for replay-based CL.

628Invariance to Planning in Goal-Conditioned RL

[openreview] [pdf]

Abstract We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Horizon generalization and invariance to planning are appealing because of their potential reach: they imply that a policy trained to reach nearby goals would succeed at reaching goals that are arbitrarily more distant.Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results, as well as recalling results from prior work, in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.

629FairDropout: Using Example-Tied Dropout to Enhance Generalization for Minority Groups

[openreview] [pdf]

Abstract Deep learning models frequently exploit spurious features in training data to achieve low training error, often resulting in poor generalization when faced with shifted testing distributions. To address this issue, various methods from imbalanced learning, representation learning, and classifier recalibration have been proposed to enhance the robustness of deep neural networks against spurious correlations. In this paper, we observe that models trained with empirical risk minimization tend to generalize well for examples from the majority groups while memorizing instances from minority groups. Building on recent findings that show memorization can be localized to a limited number of neurons, we apply example-tied dropout as a method we term \textit{FairDropout}, aimed at redirecting this memorization to specific neurons that we subsequently drop out during inference. We empirically evaluate FairDropout using the subpopulation benchmark suite encompassing vision, language, and healthcare tasks, demonstrating that it significantly reduces reliance on spurious correlations.

630Offline-to-Online Reinforcement Learning with Prioritized Experience Selection

[openreview] [pdf]

Abstract Offline-to-online reinforcement learning (O2O RL) offers a promising paradigm that first pre-trains an offline policy and fine-tunes it with further online interactions. Nevertheless, the distribution shift between the offline and online phase often hinders the fine-tuning performance, sometimes even incurring performance collapse. Existing methods mitigate this by enhancing training robustness with Q-ensemble, training a density ratio estimator to balance offline and online data, etc. But they often rely on components like ensemble and have higher training costs. In this paper, we address this issue by establishing a concrete performance bound for the optimal policies between two consecutive online steps. Motivated by the theoretical insight, we propose a simple yet effective fine-tuning method, \textbf{P}rioritized \textbf{E}xperience \textbf{S}election (PES). During the online stage, PES maintains a dynamically updated priority queue containing a portion of high-return trajectories, and only selects online samples that are close to the samples in the queue for fine-tuning. In this way, the distribution shift issue can be mitigated and the fine-tuning performance can be boosted. PES is computationally efficient and compatible with numerous approaches. Experimental results on a variety of D4RL datasets show that PES can benefit different offline and O2O RL algorithms and enhance Q-value estimate. Our code is available and will be open-source.

631Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

[openreview] [pdf]

Abstract Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

632PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories

[openreview] [pdf]

Abstract Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. PREDICT incorporates three key elements: (1) iterative refinement of inferred preferences, (2) decomposition of preferences into constituent components, and (3) validation of preferences across multiple trajectories. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment (PLUME). PREDICT more accurately infers nuanced human preferences improving over existing baselines by 66.2% (gridworld environment) and 41.0% (PLUME).

633Online Bandit Nonlinear Control with Dynamic Batch Length and Adaptive Learning Rate

[openreview] [pdf]

Abstract This paper is concerned with the online bandit nonlinear control, which aims to learn the best stabilizing controller from a pool of stabilizing and destabilizing controllers of unknown types for a given nonlinear dynamical system. We develop an algorithm, named Dynamic Batch length and Adaptive learning Rate (DBAR), and study its stability and regret. Unlike the existing Exp3 algorithm requiring an exponentially stabilizing controller, DBAR only needs a significantly weaker notion of controller stability, in which case substantial time may be required to certify the system stability. Dynamic batch length in DBAR effectively addresses this issue and enables the system to attain asymptotic stability, where the algorithm behaves as if there were no destabilizing controllers. Moreover, adaptive learning rate in DBAR only uses the state norm information to achieve a tight regret bound even when none of the stabilizing controllers in the pool are exponentially stabilizing.

634AlphaQCM: Alpha Discovery with Distributional Reinforcement Learning

[openreview] [pdf]

Abstract Finding synergistic formulaic alphas is very important but challenging for researchers and practitioners in finance. In this paper, we reconsider the discovery of formulaic alphas from the viewpoint of sequential decision-making, and conceptualize the entire alpha-mining process as a non-stationary and reward-sparse Markov decision process. To overcome the challenges of non-stationarity and reward-sparsity, we propose the AlphaQCM method, a novel distributional reinforcement learning method designed to search for synergistic formulaic alphas efficiently. The AlphaQCM method first learns the Q function and quantiles via a Q network and a quantile network, respectively. Then, the AlphaQCM method applies the quantiled conditional moment method to learn unbiased variance from the potentially biased quantiles. Guided by the learned Q function and variance, the AlphaQCM method navigates the non-stationarity and reward-sparsity to explore the vast search space of formulaic alphas with high efficacy. Empirical applications to real-world datasets demonstrate that our AlphaQCM method significantly outperforms its competitors, particularly when dealing with large datasets comprising numerous stocks.

635Federated Learning with Dynamic Client Arrival and Departure: Convergence and Rapid Adaptation via Initial Model Construction

[openreview] [pdf]

Abstract While most existing federated learning (FL) approaches assume a fixed set of clients in the system, in practice, clients can dynamically leave or join the system depending on their needs or interest in the specific task. This dynamic FL setting introduces several key challenges: (1) the objective function dynamically changes depending on the current set of clients, unlike traditional FL approaches that maintain a static optimization goal; (2) the current global model may not serve as the best initial point for the next FL rounds and could potentially lead to slow adaptation, given the possibility of clients leaving or joining the system. In this paper, we consider a dynamic optimization objective in FL that seeks the optimal model tailored to the currently active set of clients. Building on our probabilistic framework that provides direct insights into how the arrival and departure of different types of clients influence the shifts in optimal points, we establish an upper bound on the optimality gap, accounting for factors such as stochastic gradient noise, local training iterations, non-IIDness of data distribution, and deviations between optimal points caused by dynamic client pattern. We also propose an adaptive initial model construction strategy that employs weighted averaging guided by gradient similarity, prioritizing models trained on clients whose data characteristics align closely with the current one, thereby enhancing adaptability to the current clients. The proposed approach is validated on various datasets and FL algorithms, demonstrating robust performance across diverse client arrival and departure patterns, underscoring its effectiveness in dynamic FL environments.

636Beyond the Boundaries of Proximal Policy Optimization

[openreview] [pdf]

Abstract Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumaji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumaji, given the same hyperparameter tuning budget.

637Foundation Models for Enhanced Exploration in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning agents often struggle with sample inefficiency, requiring extensive interactions with the environment to develop effective policies. This inefficiency is partly due to the challenge of balancing exploration and exploitation without the abstract reasoning and prior knowledge that humans use to quickly identify rewarding actions. Recent advancements in foundation models, such as large language models (LLMs) and vision-language models (VLMs), have shown human-level reasoning capabilities in some domains but have been underutilized in directly selecting low-level actions for exploration in reinforcement learning. In this paper, we investigate the potential of foundation models to enhance exploration in reinforcement learning tasks. We conduct an in-depth analysis of their exploration behaviour in multi-armed bandit problems and Gridworld environments, comparing their performance against traditional exploration strategies and reinforcement learning agents. Our empirical results suggest foundation models can significantly improve exploration efficiency by leveraging their reasoning abilities to infer optimal actions. Building on these findings, we introduce Foundation Model Exploration (FME), a novel exploration scheme that integrates foundation models into the reinforcement learning framework for intelligent exploration behaviour. We use VLMs and demonstrate that they can infer environment dynamics and objectives from raw image observations. This means FME only requires the action space as environment-specific manual text input. We find that agents equipped with FME achieve superior performance in sparse reward Gridworld environments and scale to more complex tasks like Atari games. Moreover, the effectiveness of FME increases with the capacity of the VLM used, indicating that future advancements in foundation models will further enhance such exploration strategies.

638Is multitask learning all you need in continual learning?

[openreview] [pdf]

Abstract Continual Learning solutions often treat multitask learning as an upper-bound of what the learning process can achieve.This is a natural assumption, given that this objective directly addresses the catastrophic forgetting problem, which has been a central focus in early works. However, depending on the nature of the distributional shift in the data, the multi-task solution is not always optimal for the broader continual learning problem. In this work, we draw on principles from online learning to formalize the limitations of multitask objectives, especially when viewed through the lens of cumulative loss, which also serves as an indicator of forward transfer. We provide empirical evidence on when multi-task solutions are suboptimal, and argue that continual learning solutions should not and do not have to adhere to this assumption. Moreover, we argue for the utility of estimating the distributional drift as the data is being received and show preliminary results of how this could be exploited by a simple replay based method to move beyond the multitask solution.

639Contextual Bandits with Entropy-based Human Feedback

[openreview] [pdf]

Abstract In recent years, preference-based human feedback mechanisms have become integral to improving model performance across a range of applications, including conversational AI systems like ChatGPT. However, existing methodologies often overlook critical factors such as model uncertainty and variability in feedback quality. To address these limitations, we propose an innovative entropy-based human feedback framework designed for contextual bandits, which balances exploration and exploitation by soliciting expert feedback when model entropy surpasses a predefined threshold. Our method is model-agnostic and adaptable to any contextual bandit agent employing stochastic policies. Through rigorous experimentation, we demonstrate that our approach requires minimal human feedback to achieve significant performance gains, even with suboptimal feedback quality. Our work not only introduces a novel feedback solicitation strategy but also underscores the robustness of integrating human guidance into machine learning systems. Our code is publicly available: \url{https://anonymous.4open.science/r/CBHF-33C5}

640Preference Optimization for Reasoning with Pseudo Feedback

[openreview] [pdf]

Abstract Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated \emph{test cases}. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.3 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

641Differential Transformer

[openreview] [pdf]

Abstract Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture for large language models.

642Mobility Networked Time-Series Forecasting Benchmark Datasets

[openreview] [pdf]

Abstract Human mobility is crucial for urban planning (e.g., public transportation) and epidemic response strategies. However, existing research often neglects integrating comprehensive perspectives on spatial dynamics, temporal trends, and other contextual views due to the limitations of existing mobility datasets. To bridge this gap, we introduceMOBINS(MOBIlityNetworked timeSeries), a novel dataset collection designed for networked time-series forecasting of dynamic human movements.MOBINSfeatures diverse and explainable datasets that capture various mobility patterns across different transportation modes in four cities and two countries and cover both transportation and epidemic domains at the administrative area level. Our experiments with nine baseline methods reveal the significant impact of different model backbones on the proposed six datasets. We provide a valuable resource for advancing urban mobility research, and our dataset collection is available athttps://anonymous.4open.science/r/MOBINS.

643Avoiding Catastrophe in Online Learning by Asking for Help

[openreview] [pdf]

Abstract Most learning algorithms with formal regret guarantees assume that no mistake is irreparable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes arecatastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

644Convergence of Distributed Adaptive Optimization with Local Updates

[openreview] [pdf]

Abstract We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.

645Linear Multistep Solver Distillation for Fast Sampling of Diffusion Models

[openreview] [pdf]

Abstract Sampling from diffusion models can be seen as solving the corresponding probability flow ordinary differential equation (ODE). The solving process requires a significant number of function evaluations (NFE), making it time-consuming. Recently, several solver search frameworks have attempted to find better-performing model-specific solvers. However, predicting the impact of intermediate solving strategies on final sample quality remains challenging, rendering the search process inefficient. In this paper, we propose a novel method for designing solving strategies. We first introduce a unified prediction formula for linear multistep solvers. Subsequently, we present a solver distillation framework, which enables a student solver to mimic the sampling trajectory generated by a teacher solver with more steps. We utilize the mean Euclidean distance between the student and teacher sampling trajectories as a metric, facilitating rapid adjustment and optimization of intermediate solving strategies. The design space of our framework encompasses multiple aspects, including prediction coefficients, time step schedules, and time scaling factors. Our framework has the ability to complete a solver search for Stable-Diffusion in less than 10 total GPU hours. Compared to previous reinforcement learning-based search frameworks, our approach achieves over a 10× increase in search efficiency. With just 5 NFE, we achieve FID scores of 3.23 on CIFAR10, 7.16 on ImageNet-64, 5.44 on LSUN-Bedroom, and 15.69 on MS-COCO, resulting in a 2× sampling acceleration ratio compared to handcrafted solvers.

646Distributed In-Context Learning under Non-IID Among Clients

[openreview] [pdf]

Abstract Advancements in large language models (LLMs) have shown their effectiveness in multiple compli- cated natural language reasoning tasks. A key challenge remains in adapting these models efficiently to new or unfamiliar tasks. In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query, called in-context examples (ICE), from a training dataset and providing them during the inference as context. Most existing studies utilize a centralized training dataset, yet many real-world datasets may be distributed among multiple clients, and remote data retrieval can be associated with costs. Especially when the client data are non-identical independent distributions (non-IID), retrieving from clients a proper set of ICEs needed for a test query presents critical challenges. In this paper, we first show that in this challenging setting, test queries will have different preferences among clients because of non-IIDness, and equal contribution often leads to suboptimal performance. We then introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present. The principle is that each client’s proper contribution (budget) should be designed according to the preference of each query for that client. Our approach uses a data-driven manner to allocate a budget for each client, tailored to each test query. Through extensive empirical studies on diverse datasets, our framework demonstrates superior performance relative to competing baselines.

647Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions

[openreview] [pdf]

Abstract Learning a robust policy that is performant across the state space, in a sample efficient manner, is a long-standing problem in online reinforcement learning (RL). This challenge arises from the inability of algorithms to explore the environment efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches often fail to fully leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We show that by leveraging expert state information to form an auxiliary start state distribution, we significantly improve sample efficiency. Specifically, we show that using a notion of safety to inform the choice of auxiliary distribution significantly accelerates learning. We highlight the effectiveness of our approach by matching or exceeding state-of-the-art performance in sparse reward and dense reward setups, even when competing with algorithms with access to expert actions and rewards. Moreover, we find that the improved exploration ability facilitates learning more robust policies in spare reward, hard exploration environments.

648Towards Marginal Fairness Sliced Wasserstein Barycenter

[openreview] [pdf]

Abstract The Sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness barycenter due to the heterogeneous structure of marginals and the non-optimality of the optimization. As the first attempt to tackle the problem, we define the marginal fairness sliced Wasserstein barycenter (MFSWB) as a constrained SWB problem. Due to the computational disadvantages of the formal definition, we propose two hyperparameter-free and computationally tractable surrogate MFSWB problems that implicitly minimize the distances to marginals and encourage marginal fairness at the same time. To further improve the efficiency, we perform slicing distribution selection and obtain the third surrogate definition by introducing a new slicing distribution that focuses more on marginally unfair projecting directions. We discuss the relationship of the three proposed problems and their relationship to sliced multi-marginal Wasserstein distance. Finally, we conduct experiments on finding 3D point-clouds averaging, color harmonization, and training of sliced Wasserstein autoencoder with class-fairness representation to show the favorable performance of the proposed surrogate MFSWB problems.

649On Generalization Within Multi-Objective Reinforcement Learning Algorithms

[openreview] [pdf]

Abstract Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel testbed featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this testbed reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments.

650Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

[openreview] [pdf]

Abstract Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive: the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which be flexible extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.

651Learn Your Reference Model for Real Good Alignment

[openreview] [pdf]

Abstract Despite the fact that offline methods for Large Language Models (LLMs) alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

652Reward Dimension Reduction for Scalable Multi-Objective Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we introduce a simple yet effective reward dimension reduction method to tackle the scalability challenges of multi-objective reinforcement learning algorithms. While most existing approaches focus on optimizing two to four objectives, their abilities to scale to environments with more objectives remain uncertain. Our method uses a dimension reduction approach to enhance learning efficiency and policy performance in multi-objective settings. While most traditional dimension reduction methods are designed for static datasets, our approach is tailored for online learning and preserves Pareto-optimality after transformation. We propose a new training and evaluation framework for reward dimension reduction in multi-objective reinforcement learning and demonstrate the superiority of our method in an environment with sixteen objectives, significantly outperforming existing online dimension reduction methods.

653Generalized Anomaly Detection with Knowledge Exposure:The Dual Effects of Augmentation

[openreview] [pdf]

Abstract Anomaly detection involves identifying samples that deviate from the training data. While previous methods have demonstrated significant performance, our experiments reveal that their generalization ability declines substantially when faced with slight shifts in the test data. This limitation stems from an underlying assumption: these methods generally expect the distribution of normal test samples to closely resemble that of the training set, while anomalies are presumed to be far from this distribution. However, in real-world scenarios, test samples often experience varying degrees of distributional shift while retaining their semantic consistency. The ability to generalize successfully to semantically preserved transformations while accurately detecting normal samples whose semantic meaning has changed as anomalies is critical for a model’s trustworthiness and reliability. For instance, while a rotation may alter the semantic meaning of a car in the context of anomaly detection, it typically preserves the meaning of an apple. Yet, current methods, particularly those based on contrastive learning, are likely to detect both as anomalies. This complexity underscores the need for dynamic learning procedures grounded in a deeper understanding of outliers. To address this, we propose a novel approach called Knowledge Exposure (KE), which incorporates external knowledge to interpret concept dynamics and distinguish between transformations that induce semantic shifts. Our approach improves generalization by leveraging insights from a pre-trained CLIP model to assess the significance of anomalies for each concept. Evaluations on datasets such as CIFAR-10, CIFAR-100, SVHN demonstrate superior performance compared to previous methods, validating the effectiveness of our approach.

654Strategic Classification With Externalities

[openreview] [pdf]

Abstract We propose a new variant of the strategic classification problem: a principal reveals a classifier, and nn agents report their (possibly manipulated) features to be classified. Motivated by real-world applications, our model crucially allows the manipulation of one agent to affect another; that is, it explicitly captures inter-agent externalities. The principal-agent interactions are formally modeled as a Stackelberg game, with the resulting agent manipulation dynamics captured as a simultaneous game. We show that under certain assumptions, the pure Nash Equilibrium of this agent manipulation game is unique and can be efficiently computed. Leveraging this result, PAC learning guarantees are established for the learner: informally, we show that it is possible to learn classifiers that minimize loss on the distribution, even when a random number of agents are manipulating their way to a pure Nash Equilibrium. We also comment on the optimization of such classifiers through gradient-based approaches. This work sets the theoretical foundations for a more realistic analysis of classifiers that are robust against multiple strategic actors interacting in a common environment.

655Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

[openreview] [pdf]

Abstract Mitigating hallucinations of Large Vision Language Models (LVLMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LVLMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues powered by our novel Adversarial Question Generator (AQG), which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LVLMs. On our benchmark, the zero-shot performance of state-of-the-art LVLMs drops significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning (AIT) that robustly fine-tunes LVLMs against hallucinatory dialogues. Extensive experiments show our proposed approach successfully reduces dialogue hallucination while maintaining performance.

656Concept-driven Off Policy Evaluation

[openreview] [pdf]

Abstract Evaluating a set of decisions based on batch data as in off-policy evaluation is challenging as high variance and limited sample sizes can severely hinder reliable evaluation. Identifying and addressing the sources of this variance is essential for improving OPE performance. Recent work on Concept Bottleneck Models (CBMs) shows how a set of human-explainable concepts can be used for predictions, enabling clearer understanding and inspection of these models. Our work proposes incorporating concepts into OPE to identify and reduce variance through targeted interventions. For example, concepts such as shared disease characteristics could help predict better treatments, despite differing vital signs among two patients. We introduce a family of concept-based OPE estimators, and provide theoretical guarantees that when given a set of known concepts, these estimators are unbiased and reduce variance compared to traditional methods. However, in many real-world applications, these concepts are often unknown and need to be estimated. We develop an end-to-end algorithm for learning parameterized concepts that are interpretable, concise, diverse, and optimized for variance reduction in OPE. Through extensive experiments on synthetic and real-world datasets, we demonstrate that both known and learned concept-based estimators significantly improve OPE performance. Crucially, we show that unlike other methods for OPE, concept-based estimators can easily be interpreted and offer opportunities for targeted interventions on specific concepts of interest to further improve the quality of these estimators.

657FreeVS: Generative View Synthesis on Free Driving Trajectory

[openreview] [pdf]

Abstract Existing reconstruction-based novel view synthesis methods for driving scenes focus on synthesizing camera views along the recorded trajectory of the ego vehicle. Their image rendering performance will severely degrade on viewpoints falling out of the recorded trajectory, where camera rays are untrained. We propose FreeVS, a novel fully generative approach that can synthesize camera views on free new trajectories in real driving scenes. To control the generation results to be 3D consistent with the real scenes and accurate in viewpoint pose, we propose the pseudo-image representation of view priors to control the generation process. Viewpoint translation simulation is applied on pseudo-images to simulate camera movement in each direction. Once trained, FreeVS can be applied to any validation sequences without reconstruction process and synthesis views on novel trajectories. Moreover, we propose two new challenging benchmarks tailored to driving scenes, which are novel camera synthesis and novel trajectory synthesis, emphasizing the freedom of viewpoints. Given that no ground truth images are available on novel trajectories, we also propose to evaluate the consistency of images synthesized on novel trajectories with 3D perception models. Experiments on the Waymo Open Dataset show that FreeVS has a strong image synthesis performance on both the recorded trajectories and novel trajectories. The code will be released.

658A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

[openreview] [pdf]

Abstract In this paper, we present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state (see Fig. 1) and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics. Once the agent has learned to reach the goal state reliably, exploration is reduced. Implementing our method involves a simple modification of prior work and does not require density estimates, ensembles, or any additional hyperparameters. Intuitively, the proposed method seems like it should be terrible at exploration, and we lack a clear theoretical understanding of why it works so effectively, though our experiments provide some hints.

659DIMS: Channel-Dependent and Seasonal-Trend Independent Transformer Using Multi-Stage Training for Time Series Forecasting

[openreview] [pdf]

Abstract Due to the limited size of real-world time series data, current transformer-based time series forecasting algorithms often struggle with overfitting. Common techniques used to mitigate overfitting include channel-independence and seasonal-trend decomposition. However, channel-independent inevitably results in the loss of inter-channel dependencies, and existing seasonal-trend decomposition methods are insufficient in effectively mitigating overfitting. In this study, we propose DIMS, a time series forecasting model that uses multi-stage training to capture inter-channel dependencies while ensuring the independence of seasonal and trend components. The computation of channel dependency is postponed to the later stage, following the channel-independent training, while the seasonal and trend components remain fully independent during the early training phases. This approach enables the model to effectively capture inter-channel dependencies while minimizing overfitting. Experiments show that our model outperforms the state-of-the-art transformer-based models on several datasets.

660Safety Alignment Should be Made More Than Just a Few Tokens Deep

[openreview] [pdf]

Abstract The safety alignment of current Large Language Models (LLMs) is vulnerable. Simple attacks, or even benign fine-tuning, can jailbreak aligned models. We note that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model’s generative distribution primarily over only its very first few output tokens. We unifiedly refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and show how this issue universally contributes to multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. The key contribution of this work is that we demonstrate how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. We show that deepening the safety alignment beyond the first few tokens can meaningfully improve robustness against some common exploits. We also design a regularized fine-tuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.

661Low Variance: A Bottleneck in Diffusion-Based Graph Imputation

[openreview] [pdf]

Abstract In this paper, we tackle learning tasks on graphs with missing features, improving the applicability of graph neural networks to real-world graph-structured data. Existing imputation methods based upon graph diffusion produce channels that have nearly identical values within each channel, and these low-variance channels contribute very little to performance in graph learning tasks. To prevent diffusion-based imputation from producing low-variance channels, we introduce synthetic features that address the cause of the production, thereby increasing variance in low-variance channels. Since the synthetic features prevent diffusion-based imputation models from generating meaningless feature values shared across all nodes, our synthetic feature propagation design prevents significant performance degradation, even under extreme missing rates. Extensive experiments demonstrate the effectiveness of our scheme across various graph learning tasks with missing features, ranging from low to extremely high missing rates. Moreover, we provide empirical evidence and theoretical proof that validate the low-variance problem.

662A Primal-Dual Approach for Dynamic Pricing of Sequentially Displayed Complementary Items under Sale Constraints

[openreview] [pdf]

Abstract We address the challenging problem of dynamically pricing complementary items that are sequentially displayed to customers. An illustrative example is the online sale of flight tickets, where customers navigate through multiple web pages. Initially, they view the ticket cost, followed by ancillary expenses such as insurance and additional luggage fees. Coherent pricing policies for complementary items are essential because optimizing the pricing of each item individually is ineffective. Our scenario also involves a sales constraint, which specifies a minimum number of items to sell, and uncertainty regarding customer demand curves. To tackle this problem, we originally formulate it as a Markov decision process with constraints. Leveraging online learning tools, we design a primal-dual online optimization algorithm. We empirically evaluate our approach using synthetic settings randomly generated from real-world data, covering various configurations from stationary to non-stationary, and compare its performance in terms of constraints violation and regret against well-known baselines optimizing each state singularly.

663Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

[openreview] [pdf]

Abstract Existing evaluation benchmarks of Large Language Models (LLMs) can become outdated due to continuous model updates and the evolving information landscape. This presents a significant challenge: How can we effectively evaluate LLMs in a way that remains relevant over time? To address this, we explore the potential of future event prediction as a continuous evaluation for LLMs, assessing their ability to make predictions about real-world events and exhibit temporal generalization. Towards this goal, we propose a continuous LLM evaluation using daily news. We automatically generate question-answer (QA) pairs from daily news, constructing our Daily Oracle dataset, which challenges LLMs to predict “future” events based on its pre-training data. Our findings show that as pre-training data becomes outdated, LLMs exhibit performance degradation over time. While the Retrieval Augmented Generation (RAG) technique can enhance prediction accuracy, the performance degradation pattern still exists, underscoring the necessity for ongoing model updates.

664Emerging Tracking from Video Diffusion

[openreview] [pdf]

Abstract We find video diffusion models, renowned for their generative capabilities, surprisingly excel at pixel-level object tracking without any explicit training for this task. We introduce a simple and effective method to extract motion representations from video diffusion models, achieving state-of-the-art tracking results. Our approach enables the tracking of identical objects, overcoming limitations of previous methods reliant on intra-frame appearance correspondence. Visualizations and empirical results show that our approach outperforms recent supervised and self-supervised tracking methods, including the state-of-the-art, by up to 6 points. Our work demonstrates video generative models can learn intrinsic temporal dynamics of video, and excel in tracking tasks beyond original video synthesis.

665Accelerating Diffusion Transformers with Token-wise Feature Caching

[openreview] [pdf]

Abstract Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10X more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-alpha, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36X and 1.93X acceleration are achieved on OpenSora and PixArt-alpha with almost no drop in generation quality. Codes have been released in the supplementary material and will be released in Github.

666On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions

[openreview] [pdf]

Abstract Publishers who publish their content on the web act strategically, in a behavior that can be modeled within the online learning framework. Regret, a central concept in machine learning, serves as a canonical measure for assessing the performance of learning agents within this framework. We prove that any proportional content ranking function with a concave activation function induces games in which no-regret learning dynamics converge. Moreover, for proportional ranking functions, we prove the equivalence of the concavity of the activation function, the social concavity of the induced games and the concavity of the induced games. We also study the empirical trade-offs between publishers’ and users’ welfare, under different choices of the activation function, using a state-of-the-art no-regret dynamics algorithm. Furthermore, we demonstrate how the choice of the ranking function and changes in the ecosystem structure affect these welfare measures, as well as the dynamics’ convergence rate.

667High Probability Bounds for Cross-Learning Contextual Bandits with Unknown Context Distributions

[openreview] [pdf]

Abstract Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider & Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability\textit{high probability}. There are steps in the original analysis by Schneider & Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

668Bayesian Policy Distillation via Offline RL for Lightweight and Fast Inference

[openreview] [pdf]

Abstract High-performance deep reinforcement learning faces tremendous challenges when implemented on cost-effective low-end embedded systems due to its heavy computational burden. To address this issue, we propose a policy distillation method called Bayesian Policy Distillation (BPD), which effectively retrains small-sized neural networks through an offline reinforcement learning approach. BPD exploits Bayesian neural networks to distill already designed high-performance policy networks by adopting value optimizing, behavior cloning, and sparsity-inducing strategies. Simulation results reveal that the proposed BPD successfully compresses the policy networks, making them lighter and achieving faster inference time. Furthermore, the proposed approach is demonstrated with a real inverted pendulum system and reduced the inference time and memory size by 78 % and 98 %, respectively.

669Revisiting Source-Free Domain Adaptation: a New Perspective via Uncertainty Control

[openreview] [pdf]

Abstract Source-Free Domain Adaptation (SFDA) seeks to adapt a pre-trained source model to the target domain using only unlabeled target data, without access to the original source data. While current state-of-the-art (SOTA) methods rely on leveraging weak supervision from the source model to extract reliable information for self-supervised adaptation, they often overlook the uncertainty that arises during the transfer process. In this paper, we conduct a systematic and theoretical analysis of the uncertainty inherent in existing SFDA methods and demonstrate its impact on transfer performance through the lens of Distributionally Robust Optimization (DRO). Building upon the theoretical results, we propose a novel instance-dependent uncertainty control algorithm for SFDA. Our method is designed to quantify and exploit the uncertainty during the adaptation process, significantly improving the model performance. Extensive experiments on benchmark datasets and empirical analyses confirm the validity of our theoretical findings and the effectiveness of the proposed method. This work offers new insights into understanding and advancing SFDA performance.

670Positive Mining in Graph Contrastive Learning

[openreview] [pdf]

Abstract Graph Contrastive Learning (GCL), which aims to capture representations from unlabeled graphs, has made significant progress in recent years. In GCL, InfoNCE-based loss functions play a crucial role by ensuring that positive node pairs—those that are similar—are drawn closer together in the representational space, while negative pairs, which are dissimilar, are pushed apart. The primary focus of recent research has been on refining the contrastive loss function, particularly by adjusting the weighting of negative nodes. This is achieved by changing the weight between negative node pairs, or by using node similarity to select the positive node associated with the anchor node. Despite the substantial success of these GCL techniques, there remains a belief that the nodes identified as positive or negative may not accurately reflect the true positives and negatives. To tackle this challenge, we introduce an innovative method known as Positive Mining Graph Contrastive Learning (PMGCL). This method consists in calculating the probability of positive samples between the anchor node and other nodes using a mixture model, thereby identifying nodes that have a higher likelihood of being true positives in relation to the anchor node. We have conducted a comprehensive evaluation of PMGCL on a range of real-world graph datasets. The experimental findings indicate that PMGCL significantly outperforms traditional GCL methods. Our method not only achieves state-of-the-art results in unsupervised learning benchmarks but also exceeds the performance of supervised learning benchmarks in certain scenarios.

671DPaI: Differentiable Pruning at Initialization with Node-Path Balance Principle

[openreview] [pdf]

Abstract Pruning at Initialization (PaI) is a technique in neural network optimization characterized by the proactive elimination of weights before the network’s training on designated tasks. This innovative strategy potentially reduces the costs for training and inference, significantly advancing computational efficiency. A key element of PaI’s effectiveness is that it considers the significance of weights in an untrained network. It prioritizes the trainability and optimization potential of the pruned subnetworks. Recent methods can effectively prevent the formation of hard-to-optimize networks, e.g. through iterative adjustments at each network layer. However, this way often results inlarge-scale discrete optimization problems, which could make PaI further challenging. This paper introduces a novel method, calledDPaI, that involves a differentiable optimization of the pruning mask. DPaI adopts a dynamic and adaptable pruning process, allowing easier optimisation processes and better solutions. More importantly, our differentiable formulation enables readily use of the existing rich body of efficient gradient-based methods for PaI. Our empirical results demonstrate that DPaI significantly outperforms current state-of-the-art PaI methods on various architectures, such as Convolutional Neural Networks and Vision-Transformers.

672POTEC: Off-Policy Contextual Bandits for Large Action Spaces via Policy Decomposition

[openreview] [pdf]

Abstract We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.

673Long-Term Fairness in Reinforcement Learning with Bisimulation Metrics

[openreview] [pdf]

Abstract Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

674Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

[openreview] [pdf]

Abstract In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

675You Can Train from Scratch: Further Discussion on the Long Range Arena

[openreview] [pdf]

Abstract Despite their success, Transformers suffer from quadratic complexity in the sequence length, limiting their applicability to long-range dependency problems and making them expensive to train and run. After many proposals to address this issue, the Long Range Arena (LRA) was suggested as a benchmark to evaluate the performance of new models in long-range dependency modeling tasks. The Transformer and its variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pretraining phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we show that one can achieve the same result without a separate pretraining phase, using other training techniques. This reduces the computational burden of training and eliminates the risk of representation collapse during fine-tuning. We argue that LRA tasks are very positional and provide evidence that short-range dependencies account for a significant portion of the performance. This explains prior differences in LRA accuracy between the Transformer and new architectures, which have better positional and local biases. Our training techniques alleviate these differences up to a point, and rotary embeddings add further improvements by including these positional biases. Given these insights, LRA results should be interpreted with caution, and should be analyzed given the model’s inductive biases and the nature of the tasks.

676ContextGNN: Beyond Two-Tower Recommendation Systems

[openreview] [pdf]

Abstract Recommendation systems predominantly utilize two-tower architectures, which evaluate user-item rankings through the inner product of their respective embeddings. However, one key limitation of two-tower models is that they learn a pair-agnostic representation of users and items. In contrast, pair-wise representations either scale poorly due to their quadratic complexity or are too restrictive on the candidate pairs to rank. To address these issues, we introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep learning architecture for link prediction in recommendation systems. The method employs a pair-wise representation technique for familiar items situated within a user’s local subgraph, while leveraging two-tower representations to facilitate the recommendation of exploratory items. A final network then predicts how to fuse both pair-wise and two-tower recommendations into a single ranking of items. We demonstrate that ContextGNN is able to adapt to different data characteristics and outperforms existing methods, both traditional and GNN-based, on a diverse set of practical recommendation tasks, improving performance by 20% on average.

677Everyone Deserves Recourse: Feasible Recourse Paths Using Data Augmentation

[openreview] [pdf]

Abstract Decisions made using machine learning models can negatively impact individuals in critical applications such as healthcare and finance by denying essential services or access to opportunity. Algorithmic recourse supplements a negative AI decision by providing rejected individuals with advice on the changes they can make to their profiles, so that they may eventually achieve the desired outcome. Most existing recourse methods provide single-step changes by using counterfactual explanations. These counterfactual explanations are computed assuming a fixed (not learned) distance function. Further, few works consider providing more realistic multi-step changes in the form of recourse paths. However, such methods may fail to provide any recourse path for some individuals or provide paths that might not be feasible, since intermediate steps needed to reach the counterfactual explanation may not be realizable. We introduce a framework for learning an optimal distance function and threshold to compute multi-step recourse paths for all. First, we formalize the problem of finding multi-step recourse paths. Given a set of feasible transitions, we propose a data-driven framework for learning the optimal distance and threshold for each step with PAC (Probably Approximately Correct) guarantees. Finally, we provide a data augmentation algorithm to ensure that a solution exists for all individuals. Experiments on several datasets show that the proposed method learns feasible recourse paths for all individuals.

678Map to Optimal: Adapting Graph Out-of-Distribution in Test Time

[openreview] [pdf]

Abstract Based on topological proximity message passing, graph neural networks (GNNs) can quickly model data patterns on graphs. However, at test time, when the node feature and topological structure of the graph data are out-of-distribution (OOD), the performance of pre-trained GNNs will be hindered. Existing test-time methods either fine-tune the pre-trained model or overlook the discrepancy between the prior knowledge in pre-trained models and the test graph. We propose a novel self-supervised test-time adaptation paradigm GOAT (https://anonymous.4open.science/r/GOAT-5C0E), through graph augmentation-to-augmentation strategy, that enables a simple adapter can mitigate the distribution gap of training data and test-time data. GOAT reduces generalization error for node classification in various pre-trained settings through experiments on six benchmark datasets spanning three distinct real-world OOD scenarios. Remarkably, GOAT outperforms state-of-the-art test-time methods, and our empirical study further demonstrates the interpretability of the OOD representation generated from our method.

679Markovian Compression: Looking to the Past Helps Accelerate the Future

[openreview] [pdf]

Abstract This paper deals with distributed optimization problems that use compressed communication to achieve efficient performance and mitigate the communication bottleneck. We propose a family of compression schemes in which operators transform vectors fed to their input according to a Markov chain, i.e., the stochasticity of the compressors depends on previous iterations. Intuitively, this should accelerate the convergence of optimization methods, as considering previous iterations seems more natural and robust. The compressors are implemented in the vanilla Quantized Stochastic Gradient Descent (QSGD) algorithm. To further improve efficiency and convergence rate, we apply the momentum acceleration method. We prove convergence results for our algorithms with Markovian compressors and show theoretically that the accelerated method converges faster than the basic version. The analysis covers non-convex, Polyak-Lojasiewicz (PL), and strongly convex cases. Experiments are conducted to demonstrate the applicability of the results to distributed data-parallel optimization problems. Practical results demonstrate the superiority of methods utilizing our compressors design over several existing optimization algorithms.

680Improved Risk Bounds with Unbounded Losses for Transductive Learning

[openreview] [pdf]

Abstract In the transductive learning setting, we are provided with a labeled training set and an unlabeled test set, with the objective of predicting the labels of the test points. This framework differs from the standard problem of fitting an unknown distribution with a training set drawn independently from this distribution. In this paper, we primarily improve the generalization bounds in transductive learning. Specifically, we develop two novel concentration inequalities for the suprema of empirical processes sampled without replacement for unbounded functions, marking the first discussion of the generalization performance of unbounded functions in the context of sampling without replacement. We further provide two valuable applications of our new inequalities: on one hand, we firstly derive fast excess risk bounds for empirical risk minimization in transductive learning under unbounded losses. On the other hand, we establish high-probability bounds on the generalization error for graph neural networks when using stochastic gradient descent which improve the current state-of-the-art results.

681Trajectory-level Data Generation with Better Alignment for Offline Imitation Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) relies heavily on densely precise reward signals, which are labor-intensive and challenging to obtain in many real-world scenarios. To tackle this challenge, offline imitation learning (IL) extracts optimal policies from expert demonstrations and datasets without reward labels. However, the scarcity of expert data and the abundance of suboptimal trajectories within the dataset impede the application of supervised learning methods like behavior cloning (BC). While previous research has focused on learning importance weights for BC or reward functions to integrate with offline RL algorithms, these approaches often result in suboptimal policy performance due to training instabilities and inaccuracies in learned weights or rewards. To address this problem, we introduce Trajectory-level Data Generation with Better Alignment (TDGBA), an algorithm that leverages alignment measures between unlabeled trajectories and expert demonstrations to guide a diffusion model in generating highly aligned trajectories. With these trajectories, BC can be directly applied to extract optimal polices without the need for weight or reward learning. Moreover, to ensure high fidelity and diversity in the generated trajectories and to make the learning more stable, the implicit expert preference that can fully exploit the unlabeled data is employed in the training of the diffusion model. Experimental results on the D4RL benchmarks demonstrate that TDGBA significantly outperforms state-of-the-art offline IL methods. Additionally, the analysis of the generated trajectories shows the effectiveness of incorporating the diffusion model and implicit expert preference for trajectory-level data generation.

682GuideCO: Training Objective-Guided Diffusion Solver with Imperfect Data for Combinatorial Optimization

[openreview] [pdf]

Abstract Combinatorial optimization (CO) problems have widespread applications in science and engineering but they present significant computational challenges. Recent advancements in generative models, particularly diffusion models, have shown promise in bypassing traditional optimization solvers by directly generating near-optimal solutions. However, we observe an exponential scaling law between the optimality gap and the amount of training data needed for training diffusion-based solvers. Notably, the performance of existing diffusion solvers relies on both quantity and quality of training data: they perform well with abundant high quality training data labeled by exact or near-optimal solvers, while suffering when high-quality labels are scarce or unavailable. To address the challenge, we propose GuideCO, an objective-guided diffusion solver for combinatorial optimization, which can be trained on imperfectly labelled datasets. GuideCO is a two-stage generate-then-decode framework, featuring an objective-guided diffusion model that is further reinforced by classifier-free guidance for generating high-quality solutions on any given problem instance. Experiments demonstrate the improvements of GuideCO against baselines when trained on imperfect data, in a range of combinatorial optimization benchmark tasks such as TSP (Traveling Salesman Problem) and MIS (Maximum Independent Set).

683POMDIFFUSER: LONG-MEMORY MEETS LONG- PLANNING FOR POMDPS

[openreview] [pdf]

Abstract Effective long-term planning in complex environments benefits from not only leveraging immediate information but also utilizing past experiences. Drawing inspiration from how humans use long-term memory in decision-making, we propose the POMDiffuser framework, an approach to planning in partially observable environments. While conventional Diffuser models often memorize specific environments, POMDiffuser explores the potential of learning to plan from memory, with the aim of generalizing to new scenarios. By incorporating a memory mechanism in POMDP scenarios, our model extends diffusion-based planning models into the realm of meta-learning with carefully designed tasks that require the diffusion planner to demonstrate both long-term planning and memory utilization. We investigated existing diffusion-based models, focusing on their applicability, computational efficiency, and performance trade-offs.

684Adversarial Attack Robust dataset pruning

[openreview] [pdf]

Abstract Dataset pruning, while effective for reducing training data size, often leads to models vulnerable to adversarial attacks. This paper introduces a novel approach to create adversarially robust coresets. We first theoretically analyze how existing pruning methods result in non-smooth loss surfaces, increasing susceptibility to attacks. To address this, we propose two key innovations: (1) a Frequency-Selective Excitation Network (FSE-Net) that dynamically selects important frequency components, smoothing the loss surface while reducing storage requirements, and (2) a “Jointentropy” score for selecting stable and informative samples. Our method significantly outperforms state-of-the-art pruning algorithms across various adversarial attacks and pruning ratios. On CIFAR-10, our approach achieves up to 58.19% accuracy under AutoAttack with an 80% pruning ratio, compared to 42.98% for previous methods. Moreover, our frequency pruning technique improves robustness even on full datasets, demonstrating its potential for enhancing model security while reducing computational costs.

685Natural Policy Gradient for Average Reward Non-Stationary RL

[openreview] [pdf]

Abstract We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of ΔT\Delta_T. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods, however, despite their flexibility in practice, are not theoretically well understood in non-stationary RL. We propose the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a novel interpretation of learning rates as adapting factors. We present a dynamic regret of O~(S12A12ΔT19T89)\mathcal{\tilde{O}} (|\mathcal{S}|^{\frac{1}{2}}|\mathcal{A}|^{\frac{1}{2}}\Delta_T^{\frac{1}{9}}T^{\frac{8}{9}} ), where TT is the time horizon, and S|\mathcal{S}|, A|\mathcal{A}| are, respectively, the size of the state and action space. The regret analysis relies on adapting the Lyapunov function based analysis to dynamic environments and characterizing the effects of simultaneous changes in policy and the environment on estimates of the value function and average reward.

686Lasso Bandit with Compatibility Condition on Optimal Arm

[openreview] [pdf]

Abstract We consider a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has sparse structure. In the existing Lasso bandit literature, the compatibility conditions together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmically on the ambient dimension dd. In this paper, we demonstrate that even without the additional diversity assumptions, thecompatibility condition on the optimal armis sufficient to derive a regret bound that depends logarithmically on dd, and our assumption is strictly weaker than those used in the lasso bandit literature under the single-parameter setting. We propose an algorithm that adapts the forced-sampling technique and prove that the proposed algorithm achieves O(polylogdT)\mathcal{O}(\text{poly}\log dT) regret under the margin condition. To our knowledge, the proposed algorithm requires the weakest assumptions among Lasso bandit algorithms under the single-parameter setting that achieve O(polylogdT)\mathcal{O}(\text{poly}\log dT) regret. Through numerical experiments, we confirm the superior performance of our proposed algorithm.

687Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

[openreview] [pdf]

Abstract In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

688TabWak: A Watermark for Tabular Diffusion Models

[openreview] [pdf]

Abstract Synthetic data offers alternatives for data augmentation and sharing. Till date, it remains unknown how to use watermarking techniques to trace and audit synthetic tables generated by tabular diffusion models to mitigate potential misuses. In this paper, we design TabWak, the first watermarking method to embed invisible signatures that control the sampling of Gaussian latent codes used to synthesize table rows via the diffusion backbone. TabWak has two key features. Different from existing image watermarking techniques, TabWak uses self-cloning and shuffling to embed the secret key in positional information of random seeds that control the Gaussian latents, allowing to use different seeds at each row for high inter-row diversity and enabling row-wise detectability. To further boost the robustness of watermark detection against post-editing attacks, TabWak uses a valid-bit mechanism that focuses on the tail of the latent code distribution for superior noise resilience. We provide theoretical guarantees on the row diversity and effectiveness of detectability. We evaluate TabWak on five datasets against baselines to show that the quality of watermarked tables remains nearly indistinguishable from non-watermarked tables while achieving high detectability in the presence of strong post-editing attacks, with a 100% true positive rate at a 0.1% false positive rate on synthetic tables with fewer than 300 rows. Our code is available at the following anonymized repositoryhttps://anonymous.4open.science/r/TabWak-4E65/.

689Influence-based Attributions can be Manipulated

[openreview] [pdf]

Abstract Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} tampered by an adversary. We show that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances.

690Safe Meta-Reinforcement Learning via Dual-Method-Based Policy Adaptation: Near-Optimality and Anytime Safety Guarantee

[openreview] [pdf]

Abstract This paper studies the safe meta-reinforcement learning (safe meta-RL) problem where anytime safety is ensured during the meta-test. We develop a safe meta-RL framework that consists of two modules, safe policy adaptation and safe meta-policy training, and propose efficient algorithms for the two modules. Beyond existing safe meta-RL analyses, we prove the anytime safety guarantee of policy adaptation and provide a lower bound of the expected total reward of the adapted policies compared with the optimal policies, which shows that the adapted policies are nearly optimal. Our experiments demonstrate three key advantages over existing safe meta-RL methods: (i) superior optimality, (ii) anytime safety guarantee, and (iii) high computational efficiency.

691Eligibility Traces for Confounding Robust Off-Policy Evaluation: A Causal Approach

[openreview] [pdf]

Abstract A unifying theme in Artificial Intelligence is learning an effective policy to control an agent in an unknown environment in order to optimize a certain performance measure. Off-policy methods can significantly improve the sample efficiency during training since they allow an agent to learn from observed trajectories generated by different behavior policies, without directly deploying the target policies in the underlying environment. This paper studies off-policy evaluation from biased offline data where (1) unobserved confounding bias cannot be ruled out a priori; or (2) the observed trajectories do not overlap with intended behaviors of the learner, i.e., the target and behavior policies do not share a common support. Specifically, we first extend the Bellman’s equation to derive effective closed-form bounds over value functions from the observational distribution contaminated with unobserved confounding and no-overlap. Second, we propose two novel algorithms that use eligibility traces to estimate these bounds from finite observational data. Compared to other partial identification methods for off-policy evaluation in sequential environments, these methods are model-free and do not rely on additional parametric knowledge about the system dynamics in the underlying environment.

692Numerical Pitfalls in Policy Gradient Updates

[openreview] [pdf]

Abstract Numerical instability, such as gradient explosion, is a fundamental problem in practical deep reinforcement learning (DRL) algorithms. Beyond anecdotal debugging heuristics, there is a lack of systematic understanding of the causes for numerical sensitivity that leads to exploding gradient failures in practice. In this work, we demonstrate that the issue arises from the ill-conditioned density ratio in the surrogate objective that comes from importance sampling, which can take excessively large values during training. Perhaps surprisingly, while various policy optimization methods such as TRPO and PPO prevent excessively large policy updates, their optimization constraints on KL divergence and probability ratio cannot guarantee numerical stability. This also explains why gradient explosion often occurs during DRL training, even with code-level optimizations. We also discuss several potential approaches to ensure numerical stability and the challenges associated with them.

693Retrospective Learning from Interactions

[openreview] [pdf]

Abstract Multi-turn language interactions naturally include implicit feedback signals. For example, if a listener responds in an unexpected way to an instruction, the instructor may rephrase it, express frustration, or pivot to an alternative task. These signals are task-independent and occupy a relatively constrained subspace of language, allowing a language model to identify them even if it fails on the actual task. This holds the promise of continually learning and improving from interactions without additional annotations. We introduceReSpect, a method to learn from signals in past interactions via retrospection. We deployReSpectin a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show howReSpectgradually improves task completion rate from 31% to 82%, all without any external annotation.

694Policy Design in Long-run Welfare Dynamics

[openreview] [pdf]

Abstract We study a stochastic dynamic model of long-term welfare in a population. Individuals in our model have welfare that improves with intervention and deteriorates in the absence of treatment. The planner can treat one individual at each time step. We contrast two fundamental policies in our model. The utilitarian policy greedily maximizes welfare improvement at each step. The Rawlsian policy intervenes on the individual of lowest welfare. Although hugely influential as a normative proposal, Rawlsian policies have been criticized for failing to optimize social welfare. We prove that, surprisingly, in a meaningful range of parameters Rawlsian policy has greater long-run utility than the utilitarian policy even though it is inferior on short time horizons. Specifically, this is true provided that treatment effects satisfy a weak homogeneity assumption, and the welfare dynamics satisfy a rich-get-richer and poor-get-poorer condition. We extend our results with a comprehensive comparison of different policies under different parameter regimes. Through semi-synthetic simulation studies, we evaluate various policies in cases where the assumptions of our theorems do not hold. Our results illustrate that comparing policies based on short-term evaluations can lead to misleading conclusions.

695Learning with Real-time Improving Predictions in Online MDPs

[openreview] [pdf]

Abstract In this paper, we introduce the Decoupling Optimistic Online Mirror Descent (DOOMD) algorithm, a novel online learning approach designed for episodic Markov Decision Processes with real-time improving predictions. Unlike conventional methods that employ a fixed policy throughout each episode, our approach allows for continuous updates of both predictions and policies within an episode. To achieve this, the DOOMD algorithm decomposes decision-making across states, enabling each state to execute an individual sub-algorithm that considers both immediate and long-term effects on future decisions. We theoretically establish a sub-linear regret bound for the algorithm, providing a guarantee on the worst-case performance.

696The Discretization Complexity Analysis of Consistency Models under Variance Exploding Forward Process

[openreview] [pdf]

Abstract Consistency models, a new class of one-step generative models, have shown state-of-the-art performance in one-step generation and achieve competitive performance compared to multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the diffusion process and trains a consistency function to map any point at any discretized timepoint of the diffusion process to the data distribution. Despite the empirical success, only a few works focus on the discretization complexity of consistency models. However, the setting of those works is far away from the empirical consistency models with good performance, suffers from large discretization complexity, and fails to explain the empirical success of consistency models. To bridge the gap between theory and application, we analyze consistency models with two key properties: (1) variance exploding forward process and (2) gradually decay discretization stepsize, which are both widely used in empirical consistency models. Under the above realistic setting, we make the first step to explain the empirical success of consistency models and achieve the state-of-the-art discretization complexity for consistency models, which is competitive with the results of diffusion models. After obtaining the results of the one-step sampling method of consistency models, we further analyze a multi-step consistency sampling algorithm proposed by \citet{song2023consistency} and show that this algorithm improves the discretization complexity compared with one-step generation, which matches the empirical observation.

697Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

[openreview] [pdf]

Abstract Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (\algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an O~(dT+dC)\tilde O(d\sqrt{T}+dC) regret bound, where TT is the number of rounds, dd is the dimension of the context, and 0CT 0 \le C \le T is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without (C=0C=0) adversarial feedback. To the best of our knowledge, our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, we conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

698Bidirectional Consistency Models

[openreview] [pdf]

Abstract Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce Bidirectional Consistency Model (BCM), which learns asingleneural network that enables bothforward and backwardtraversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. We can train BCM from scratch or tune it using a pre-trained consistency model, which reduces the training cost and increases scalability. We demonstrate that BCM enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. We further showcase BCM’s capability in downstream tasks, such as interpolation, inpainting, and blind restoration of compressed images. Notably, when the number of function evaluations (NFE) is constrained, BCM surpasses domain-specific restoration methods, such as I2SB and Palette, in a fully zero-shot manner, offering an efficient alternative for inversion problems.

699Feedback Favors the Generalization of Neural ODEs

[openreview] [pdf]

Abstract The well-known generalization problem hinders the application of artificial neural networks in continuous-time prediction tasks with varying latent dynamics. In sharp contrast, biological systems can neatly adapt to evolving environments benefiting from real-time feedback mechanisms. Inspired by the feedback philosophy, we present feedback neural networks, showing that a feedback loop can flexibly correct the learned latent dynamics of neural ordinary differential equations (neural ODEs), leading to a prominent generalization improvement. The feedback neural network is a novel two-DOF neural network, which possesses robust performance in unseen scenarios with no loss of accuracy performance on previous tasks. A linear feedback form is presented to correct the learned latent dynamics firstly, with a convergence guarantee. Then, domain randomization is utilized to learn a nonlinear neural feedback form. Finally, extensive tests including trajectory prediction of a real irregular object and model predictive control of a quadrotor with various uncertainties, are implemented, indicating significant improvements over state-of-the-art model-based and learning-based methods.

700Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer

[openreview] [pdf]

Abstract Decision Transformer (DT) plays a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, such ideal data is often lacking, with the underrepresentation of optimal behaviours posing a significant challenge. This limitation highlights the difficulty of relying on offline datasets for training, as suboptimal data can hinder performance. To address this, we propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel framework inspired by counterfactual reasoning. CRDT enhances DT’s ability to reason beyond known data by generating and utilizing counterfactual experiences, enabling improved decision-making in out-of-distribution scenarios. Extensive experiments across continuous and discrete action spaces, including environments with limited data, demonstrate that CRDT consistently outperforms conventional DT approaches. Additionally, reasoning counterfactually allows the DT agent to obtain stitching ability, allowing it to combine suboptimal trajectories. These results highlight the potential of counterfactual reasoning to enhance RL agents’ performance and generalization capabilities.

701Adversarial Policy Optimization for Preference-based Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we study offline preference-based reinforcement learning (PbRL), where learning is based on pre-collected preference feedback over pairs of trajectories. While offline PbRL has demonstrated remarkable empirical success, existing theoretical approaches face challenges in ensuring conservatism under uncertainty, requiring computationally intractable confidence set constructions. We address this limitation by proposing Adversarial Preference-based Policy Optimization (APPO), a computationally efficient algorithm for offline PbRL that guarantees sample complexity bounds without relying on explicit confidence sets. By framing PbRL as a two-player game between a policy and a model, our approach enforces conservatism in a tractable manner. Using standard assumptions on function approximation and bounded trajectory concentrability, we derive sample complexity bound. To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability. Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods.

702Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

[openreview] [pdf]

Abstract The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-α). The visual exhibition of Qihoo-T2X is available athttps://qihoo-t2x.github.io/.

703Learning Utilities from Demonstrations in Markov Decision Processes

[openreview] [pdf]

Abstract Our goal is to extract useful knowledge from demonstrations of behavior in sequential decision-making problems. Although it is well-known that humans commonly engage inrisk-sensitivebehaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume arisk-neutralagent. Beyond introducing model misspecification, these models do not directly capture the risk attitude of the observed agent, which can be crucial in many applications. In this paper, we propose a novel model of behavior in Markov Decision Processes (MDPs) that explicitly represents the agent’s risk attitude through autilityfunction. We then define the Utility Learning (UL) problem as the task of inferring the observed agent’s risk attitude, encoded via a utility function, from demonstrations in MDPs, and we analyze the partial identifiability of the agent’s utility. Furthermore, we devise two provably efficient algorithms for UL in a finite-data regime, and we analyze their sample complexity. We conclude with proof-of-concept experiments that empirically validate both our model and our algorithms.

704FOSP: Fine-tuning Offline Safe Policy through World Models

[openreview] [pdf]

Abstract Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot deployment using limited data. It demonstrates that our approach significantly improves the generalization of offline policies to unseen safety-constrained scenarios. To the best of our knowledge, this is the first work to explore offline-to-online RL for safe generalization tasks. The videos are available athttps://sites.google.com/view/safefinetune/home.

705Deconstructing Denoising Diffusion Models for Self-Supervised Learning

[openreview] [pdf]

Abstract In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive process allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.

706Efficient Diffusion Models for Symmetric Manifolds

[openreview] [pdf]

Abstract We present a framework for designing efficient diffusion models on symmetric Riemannian manifolds, which include the torus, sphere, special orthogonal group, and unitary group. While diffusion models on symmetric manifolds have gained significant attention, existing approaches often rely on the manifolds’ heat kernels, which lack closed-form expressions and result in exponential-in-dimension per-iteration runtimes during training. We introduce a new diffusion model for symmetric-space manifolds, leveraging a projection of Euclidean Brownian motion to bypass explicit heat kernel computations. Our training algorithm minimizes a novel objective function derived via Ito’s Lemma, with efficiently computable gradients, allowing each iteration to run in polynomial time for symmetric manifolds. Additionally, the symmetries of the manifold ensure the diffusion satisfies an “average-case” Lipschitz condition, enabling accurate and efficient sample generation. These improvements enhance both the training runtime and sample accuracy for key cases of symmetric manifolds, helping to bridge the gap between diffusion models on symmetric manifolds and Euclidean space.

707Pretraining Decision Transformers with Reward Prediction for In-Context Structured Bandit Learning

[openreview] [pdf]

Abstract In this paper, we study the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure so as to generalize to the test task. The prior work of pretrained decision transformers like \dpt\ requires access to the optimal action during training which may be hard in several scenarios. Diverging from these works, our learning algorithm does not need the knowledge of optimal action per task during training but predicts a reward vector for each of the actions using only the observed offline data from the diverse training tasks. Finally, during inference time, it selects action using the reward predictions employing various exploration strategies in-context for an unseen test task. We show that our model outperforms other SOTA methods like \dpt, and Algorithmic Distillation (\ad) over a series of experiments on several structured bandit problems (linear, bilinear, latent, non-linear). Interestingly, we show that our algorithm, without the knowledge of the underlying problem structure, can learn a near-optimal policy in-context by leveraging the shared structure across diverse tasks. We further extend the field of pre-trained decision transformers by showing that they can leverage unseen tasks with new actions and still learn the underlying latent structure to derive a near-optimal policy. We validate this over several experiments to show that our proposed solution is very general and has wide applications to potentially emergent online and offline strategies at test time. Finally, we theoretically analyze the performance of our algorithm and obtain generalization bounds in the in-context multi-task learning setting.

708Is Memorization Actually Necessary for Generalization?

[openreview] [pdf]

Abstract Memorization is the ability of deep models to associate training data with seemingly random labels. Even though memorization may not align with a model’s ability to generalize, recent work by~\citet{feldman2020longtail} has demonstrated that memorization is in fact \textit{necessary} for generalization. However, upon closer inspection, we find that their methodology has three limitations. First, the definition of memorization is imprecise, leading to contradictory results. Second, their proposed algorithm used for \textit{approximating} the leave-one-out test (the gold standard for calculating memorization scores) suffers from a high approximation error. Three, the authors induce a distribution shift when calculating marginal utility, leading to flawed results. Having accounted for these errors, we re-evaluate the role of memorization on generalization. We show that most memorization thresholds (the value that dictates whether a point is memorized or not) do not have a statistically significant impact on model accuracy, contrary to what was previously reported. In light of these findings, future researchers are encouraged to design techniques that can accurately approximate memorization scores.

709Enhancing Graph Invariant Learning from a Negative Inference Perspective

[openreview] [pdf]

Abstract The out-of-distribution (OOD) generalization challenge is a longstanding problem in graph learning. Through studying the fundamental cause of data distribution shift, i.e., the changes of environments, significant progress has been achieved in addressing this issue. However, we observe that existing works still fail to effectively address complex environment shifts. Previous practices place excessive attention on extracting causal subgraphs, inevitably treating spurious subgraphs as environment variables. While spurious subgraphs are controlled by environments, the space of environment changes encompass more than the scale of spurious subgraphs. Therefore, existing efforts have a limited inference space for environments, leading to failure under severe environment changes. To tackle this issue, we propose a negative inference graph OOD framework (NeGo) to broaden the inference space for environment factors. Inspired by the successful practice of prompt learning in capturing underlying semantics and causal associations in large language models, we design a negative prompt environment inference to extract underlying environment information. We further introduce the environment-enhanced invariant subgraph learning method to effectively exploit inferred environment embedding, ensuring the robust extraction of causal subgraph in the environment shifts. Lastly, we conduct a comprehensive evaluation of NeGo on real-world datasets and synthetic datasets across domains. NeGo outperforms baselines on nearly all datasets, which verify the effectiveness of our framework. Our source code is available at \url{https://anonymous.4open.science/r/NeGo-E4C1}.

710Fixing Data Augmentations for Out-of-distribution Detection

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection methods, especially post-hoc methods, rely on off-the-shelf pre-trained models. Existing literature shows how OOD and ID performance are correlated, i.e. stronger models with better ID performance tend to perform better in OOD detection. However, significant performance discrepancies exist between model versions, sometimes exceeding the impact of the OOD detection methods themselves. In this study, we systematically investigated this issue and identified two main factors—label smoothing and mixup—that, while improving in-distribution accuracy, lead to a decline in OOD detection performance. We provide empirical and theoretical explanations for this phenomenon and propose a solution that enhances OOD Detection while maintaining strong in-distribution performance. Code will be released upon acceptance.

711Single-Step Diffusion Model-Based Generative Model Inversion Attacks

[openreview] [pdf]

Abstract Generative model inversion attacks (MIAs) have garnered increasing attention for their ability to reconstruct synthetic samples that closely resemble private training data, exposing significant privacy risks in machine learning models. The success of generative MIAs is primarily attributed to image priors learned by generative adversarial networks (GANs) on public auxiliary data, which help constrain the optimization space during the inversion process. However, GAN-based generative MIAs still face limitations, particularly regarding the instability during model inversion optimization and the fidelity of reconstructed samples, indicating substantial room for improvement. In this paper, we address these challenges by exploring generative MIAs based on diffusion models, which offer superior generative performance compared to GANs. Specifically, we replace the GAN generator in existing generative MIAs with a single-step generator distilled from pretrained diffusion models, constraining the search space to the manifold of the generator during the inversion process. In addition, we leverage generative model inversion techniques to investigate privacy leakage issues in widely used large-scale multimodal models, particularly CLIP, highlighting the inherent privacy risks in these models. Our extensive experiments demonstrate that single-step diffusion models-based MIAs significantly outperform their GAN-based counterparts, achieving substantial improvements in traditional metrics and greatly enhancing the visual fidelity of reconstructed samples. This research uncovers vulnerabilities in CLIP models and opens new research directions in generative MIAs.

712Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

[openreview] [pdf]

Abstract Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of Regressional Goodhart’s effect, we identify the existence of exogenous variables impacting the relationship between RM quality measured by accuracy and policy model capability. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.

713On the Relation Between Linear Diffusion and Power Iteration

[openreview] [pdf]

Abstract Recently, diffusion models have gained popularity due to their impressive generative abilities. These models learn the implicit distribution given by the training dataset, and sample new data by transforming random noise through the reverse process, which can be thought of as gradual denoising. In this work, we examine the generation process as a ``correlation machine’', where random noise is repeatedly enhanced in correlation with the implicit given distribution. To this end, we explore the linear case, where the optimal denoiser is known to be the PCA projection. This enables us to connect the theory of diffusion models to the spiked covariance model, where the dependence of the denoiser on the noise level and the amount of training data can be expressed analytically, in the rank-1 case. In a series of numerical experiments, we extend this result to general low rank data, and show that low frequencies emerge earlier in the generation process, where the denoising basis vectors are more aligned to the true data with a rate depending on their eigenvalues. This model allows us to show that the linear diffusion model converges in mean to the leading eigenvector of the underlying data, similarly to the prevalent Power Iteration method. Finally, we empirically demonstrate the applicability of our findings beyond the linear case, in the Jacobians of a deep, non-linear denoiser, used in general image generation tasks.

714On Extending Direct Preference Optimization to Accommodate Ties

[openreview] [pdf]

Abstract We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.

715One to All: Individual Reweighting for User-Oriented Fairness in Recommender Systems

[openreview] [pdf]

Abstract Recommender systems often manifest biases toward a small user group, resulting in pronounced disparities in recommendation performance, i.e., the User-Oriented Fairness (UOF) issue. Existing research on UOF faces three major limitations, and no single approach effectively addresses all of them. Limitation 1: Post-processing methods fail to address the root cause of the UOF issue. Limitation 2: Some in-processing methods rely heavily on unstable user similarity calculations under severe data sparsity problems. Limitation 3: Other in-processing methods overlook the disparate treatment of individual users within user groups. In this paper, we propose a novel Individual Reweighting for User-Oriented Fairness framework, namely IR-UOF, to address all the aforementioned limitations. IR-UOF serves as a versatile solution applicable across various backbone recommendation models to achieve UOF. The motivation behind IR-UOF is to introduce an in-processing strategy that addresses the UOF issue at the individual level without the need to explore user similarities. We conduct extensive experiments on three real-world datasets using four backbone recommendation models to demonstrate the effectiveness of IR-UOF in mitigating UOF and improving recommendation fairness.

716A Causal Theoretical Framework for Open Set Domain Adaptation

[openreview] [pdf]

Abstract Open Set Domain Adaptation (OSDA) faces two critical challenges: the emergence of unknown classes in the target domain and changes in observed distributions across domains. Although numerous studies have proposed advanced algorithms, recent experimental results demonstrate that the classical Empirical Risk Minimization (ERM) approach still delivers state-of-the-art performance. However, few theories can effectively explain this disputed phenomenon. To address the theoretical gap, we focus on constructing a causal theoretical framework for OSDA. We formulate the novel concepts of the Fully Informative Causal Invariance Model (FICIM) and the Partially Informative Causal Invariance Model (PICIM). Subsequently, We derive an OSDA theoretical bound to prove that the ERM performs well when the source domain follows FICIM, while it performs poorly when the source domain follows PICIM. The different results may be attributed to the varying amounts of available information when bounding the target domain’s stable expected risk. Finally, across different datasets, we conduct extensive experiments on the FICIM and PICIM source domains to validate the effectiveness of our theoretical results.

717Spatial-aware decision-making with ring attractors in Reinforcement Learning systems

[openreview] [pdf]

Abstract This paper explores the integration of ring attractors, a mathematical model inspired by neural circuit dynamics, into the reinforcement learning (RL) action selection process. Ring attractors, as specialized brain-inspired structures that encode spatial information and uncertainty, offer a biologically plausible mechanism to improve learning speed and predictive performance. They do so by explicitly encoding the action space, facilitating the organization of neural activity, and enabling the distribution of spatial representations across the neural network in the context of deep RL. The application of ring attractors in the RL action selection process involves mapping actions to specific locations on the ring and decoding the selected action based on neural activity. We investigate the application of ring attractors by both building them as exogenous models and integrating them as part of a Deep Learning policy algorithm. Our results show a significant improvement in state-of-the-art models for the Atari 100k benchmark. Notably, our integrated approach improves the performance of state-of-the-art models by half, representing a 53% increase over selected baselines.

718CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Moreover, CFG++ can be easily integrated into the high-order diffusion solvers and naturally extends to distilled diffusion models. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page:https://cfgpp-diffusion.github.io/anon

719Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models

[openreview] [pdf]

Abstract While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching throughscore-entropywithin a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing thedenoising cross-entropybetween the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps. To further support our findings, we introduce and evaluate a novel CTMC transition-rate matrix that allows prediction refinement, and derive the analytic expression for its matrix exponential which facilitates the computation of conditional ratios thus enabling efficient training and generation.

720Elephant in the Room: Unveiling the Pitfalls of Human Proxies in Alignment

[openreview] [pdf]

Abstract The demand for regulating the behavior of large language models (LLMs) has ignited research on alignment algorithms, the essence of which is to align LLMs’ generations with human preferences. Due to infeasibility of humans directly participating in the training or generation of LLMs, existing alignment algorithms choose to align with human preferences carried by proxies, i.e., preference data or reward models. However, whether these human proxies faithfully represent human preferences remain under-explored. We categorize human proxies into two levels based on the degree to which they directly embody human preferences: Level-1 Proxy (preference data) and Level-2 Proxy (reward models). We empirically examine the faithfulness of both levels of proxies and its impacts on alignment performance. We notice that current algorithms tend to overlook the faithfulness of these proxies in reflecting human preferences; many works even directly use reward models as their automatic evaluators without any correlation verification. Current literature of alignment overly focuses on optimizing algorithms, rendering the faithfulness of human proxies an “elephant in the room”—something extremely important yet largely overlooked. According to experimental results, we unveil potential risks of using inferiorhuman proxies’‘, aiming to arouse attention to this hugeelephant’’ in alignment research. We summarize existing pitfalls from different angles and provide a re-labeled preference dataset and insights about reward model usage to facilitate the healthy development of alignment\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.

721Looking into User’s Long-term Interests through the Lens of Conservative Evidential Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) been increasingly employed in modern recommender systems to capture users’ evolving preferences, leading to continuously improved recommendations. In this paper, we propose a novel evidential conservative Q-learning framework (ECQL) that learns an effective and conservative recommendation policy by integrating evidence-based uncertainty and conservative learning. ECQL conducts evidence-aware explorations to discover items that are located beyond current observations but reflect users’ long-term interests. It offers an uncertainty-aware conservative view on policy evaluation to discourage deviating too much from users’ current interests. Two central components of ECQL include a uniquely designed sequential state encoder and a novel conservative evidential-actor-critic (CEAC) module. The former generates the current state of the environment by aggregating historical information and a sliding window that contains the current user interactions as well as newly recommended items from RL exploration that may represent future interests. The latter performs an evidence-based rating prediction by maximizing the conservative evidential Q-value and leverages an uncertainty-aware ranking score to explore the item space for a more diverse and valuable recommendation. Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of ECQL and its capability to capture users’ long-term interests.

722Latent Trajectory: A New Framework for Actor-Critic Reinforcement Learning with Uncertainty Quantification

[openreview] [pdf]

Abstract Uncertainty quantification for deep neural networks is crucial for building reliable modern AI models. This challenge is particularly pronounced in deep reinforcement learning, where agents continuously learn from their interactions with stochastic environments, and the uncertainty of the value function is a key concern for ensuring reliable and robust RL applications. The complexity increases in actor-critic methods, as the training process alternates between optimizing the actor and critic networks, whose optimization nature makes the uncertainty of the value function hard to be quantified. To address this issue, we introduce a novel approach to RL training that conceptualizes transition trajectories as latent variables. Building on this framework, we propose an adaptive Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) algorithm for training deep actor-critic models. This new training method allows for the implicit integration of latent transition trajectories, resulting in a trajectory-independent training process. We provide theoretical guarantees for the convergence of our algorithm and offer empirical evidence showing improvements in both performance and robustness of the deep actor-critic model under our Latent Trajectory Framework (LTF). Furthermore, this framework enables accurate uncertainty quantification for the value function of the RL system, paving the way for more reliable and robust RL applications.

723Diverse Preference Learning for Capabilities and Alignment

[openreview] [pdf]

Abstract As LLMs increasingly impact society, their ability to represent diverse perspectives is critical. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to overweight majority opinions and sacrifice diversity in exchange for optimal reward. To address this, we propose Diverse Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty — allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Diverse Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Diverse Preference Learning resembles, but is a Pareto improvement over standard temperature scaling.

724Utilizing Explainable Reinforcement Learning to Improve Reinforcement Learning: A Theoretical and Systematic Framework

[openreview] [pdf]

Abstract Reinforcement learning (RL) faces two challenges: (1) The RL agent lacks explainability. (2) The trained RL agent is, in many cases, non-optimal and even far from optimal. To address the first challenge, explainable reinforcement learning (XRL) is proposed to explain the decision-making of the RL agent. In this paper, we demonstrate that XRL can also be used to address the second challenge, i.e., improve RL performance. Our method has two parts. The first part provides a two-level explanation for why the RL agent is not optimal by identifying the mistakes made by the RL agent. Since this explanation includes the mistakes of the RL agent, it has the potential to help correct the mistakes and thus improve RL performance. The second part formulates a constrained bi-level optimization problem to learn how to best utilize the two-level explanation to improve RL performance. In specific, the upper level learns how to use the high-level explanation to shape the reward so that the corresponding policy can maximize the cumulative ground truth reward, and the lower level learns the corresponding policy by solving a constrained RL problem formulated using the low-level explanation. We propose a novel algorithm to solve this constrained bi-level optimization problem, and theoretically guarantee that the algorithm attains global optimality. We use MuJoCo experiments to show that our method outperforms state-of-the-art baselines.

725Thinking Forward and Backward: Effective Backward Planning with Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have exhibited remarkable reasoning and planning capabilities. Most prior work in this area has used LLMs to reason through steps from an initial to a goal state or criterion, thereby effectively reasoning in a forward direction. Nonetheless, many planning problems exhibit an inherent asymmetry such that planning backward from the goal is significantly easier --- for example, if there are bottlenecks close to the goal. We take inspiration from this observation and demonstrate that this bias holds for LLM planning as well: planning performance in one direction correlates with the planning complexity of the problem in that direction. However, our experiments also reveal systematic biases which lead to poor planning in the backward direction. With this knowledge, we propose a backward planning algorithm for LLMs that first flips the problem and then plans forward in the flipped problem. This helps avoid the backward bias, generate more diverse candidate plans, and exploit asymmetries between the forward and backward directions in planning problems --- we find that combining planning in both directions with self-verification improves the overall planning success rates by 4-24% in three planning domains.

726Task Characteristic and Contrastive Contexts for Improving Generalization in Offline Meta-Reinforcement Learning

[openreview] [pdf]

Abstract Context-based offline meta-reinforcement learning (meta-RL) methods typically extract contexts summarizing task information from historical trajectories to achieve adaptation to unseen target tasks. Nevertheless, previous methods may lack generalization and suffer from ineffective adaptation. Our key insight to counteract this issue is that they fail to capture both task characteristic and task contrastive information when generating contexts. In this work, we propose a framework called task characteristic and contrastive contexts for offline meta-RL (TCMRL), which consists of a task characteristic extractor and a task contrastive loss. More specifically, the task characteristic extractor aims at identifying transitions within a trajectory, that are characteristic of a task, when generating contexts. Meanwhile, the task contrastive loss favors the learning of task information that distinguishes tasks from one another by considering interrelations among transitions of trajectory subsequences. Contexts that include both task characteristic and task contrastive information provide a comprehensive understanding of the tasks themselves and implicit relationships among tasks. Experiments in meta-environments show the superiority of TCMRL over previous offline meta-RL methods in generating more generalizable contexts, and achieving efficient and effective adaptation to unseen target tasks.

727CADO: Cost-Aware Diffusion Models for Combinatorial Optimization via RL Fine-tuning

[openreview] [pdf]

Abstract Recent advancements in Machine Learning (ML) have demonstrated significant potential in addressing Combinatorial Optimization (CO) problems through data-driven approaches. Heatmap-based methods, which generate solution heatmaps in a single step and employ an additional decoder to derive solutions for CO tasks, have shown promise due to their scalability for large-scale problems. Traditionally, these complex models are trained using imitation learning with optimal solutions, often leveraging diffusion models. However, our research has identified several limitations inherent in these imitation learning approaches within the context of CO tasks. To overcome these challenges, we propose a 2-phase training framework for diffusion models in CO, incorporating Reinforcement Learning (RL) fine-tuning. Our methodology integrates cost information and the post-process decoder into the training process, thereby enhancing the solver’s capacity to generate effective solutions. We conducted extensive experiments on well-studied combinatorial optimization problems, specifically the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), ranging from small-scale instances to large-scale scenarios. The results demonstrate the significant efficacy of our RL fine-tuning framework, surpassing previous state-of-the-art methods in performance.

728DRoP: Distributionally Robust Pruning

[openreview] [pdf]

Abstract In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets.

729Seeking Global Flat Minima in Federated Domain Generalization via Constrained Adversarial Augmentation

[openreview] [pdf]

Abstract Federated domain generalization (FedDG) aims at equipping the federally trained model with the domain generalization ability when the model meets new clients with domain shifts. Among factors that possibly indicate generalization, the loss landscape flatness of the trained model is an intuitive, viable, and widely studied one. However, pursuing the flatness of the global model in the FedDG setting is not trivial due to the restriction to preserve data privacy. To address this issue, we propose GFM, a novel algorithm designed to seek Global Flat Minima of the global model. Specifically, GFM leverages a global model-constrained adversarial data augmentation strategy, creating a surrogate for global data within each local client, which allows for split sharpness-aware minimization to approach global flat minima. GFM is compatible with federated learning without compromising data privacy restrictions, and theoretical analysis further supports its rationality by demonstrating that the objective of GFM serves as an upper bound on the robust risk of the global model on global data distribution. Extensive experiments on multiple FedDG benchmarks demonstrate that GFM consistently outperforms previous FedDG and federated learning approaches.

730The Utility and Complexity of In- and Out-of-Distribution Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution data, we show that a surprisingly simple and general procedure—empirical risk minimization with output perturbation—achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning ``for free" via differential privacy. However, such techniques fail out-of-distribution, where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

731DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction

[openreview] [pdf]

Abstract Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model’s uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model’s reliability by detecting hallucinations, improving upon self-consistency-based uncertainty methods by 2.5%. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions.

732Dynamic Multi-product Selection and Pricing under Preference Feedback

[openreview] [pdf]

Abstract In this study, we investigate the problem of dynamic multi-product selection and pricing by introducing a novel framework based on acensored multinomial logit(C-MNL) choice model. In this model, sellers present a set of products with prices, and buyers filter out products priced above their valuation, purchasing at most one product from the remaining options based on their preferences. The goal is to maximize seller revenue by dynamically adjusting product offerings and prices, while learning both product valuations and buyer preferences through purchase feedback. To achieve this, we propose a Lower Confidence Bound (LCB) pricing strategy. By combining this pricing strategy with either an Upper Confidence Bound (UCB) or Thompson Sampling (TS) product selection approach, our algorithms achieve regret bounds of O~(d32T)\tilde{O}(d^{\frac{3}{2}}\sqrt{T}) and O~(d2T)\tilde{O}(d^{2}\sqrt{T}), respectively. Finally, we validate the performance of our methods through simulations, demonstrating their effectiveness.

733Q-based Variational Inverse Reinforcement Learning

[openreview] [pdf]

Abstract The development of safe and beneficial AI requires that systems can learn and act in accordance with human preferences. However, explicitly specifying these preferences by hand is often infeasible. Inverse reinforcement learning (IRL) addresses this challenge by inferring preferences, represented as reward functions, from expert behavior. We introduce Q-based Variational IRL (QVIRL), a novel Bayesian IRL method that recovers a posterior distribution over rewards from expert demonstrations via primarily learning a variational distribution over Q-values. Unlike previous approaches, QVIRL combines scalability with uncertainty quantification, important for safety-critical applications. We demonstrate QVIRL’s strong performance in apprenticeship learning across various tasks, including classical control problems and safe navigation in the Safety Gymnasium suite, where the method’s uncertainty quantification allows us to produce safer policies.

734Learning from Preferences and Mixed Demonstrations in General Settings

[openreview] [pdf]

Abstract Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are either ad-hoc or rely on domain-specific properties. Building upon previous work, we develop a novel theoretical framework for learning from human data. Based on this we introduce LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can simultaneously learn from a broad range of data, including negative/failed demonstrations, to effectively learn reward functions in general domains. We find that when a limited amount of human feedback is available, LEOPARD outperforms the current standard practice of pre-training on demonstrations and finetuning on preferences. Furthermore, we show that LEOPARD learns faster when given many types of feedback, rather than just a single one.

735Identifying and Addressing Delusions for Target-Directed Decision Making

[openreview] [pdf]

Abstract We are interested in target-directed agents, which produce targets during decision-time planning, to guide their behaviors and achieve better generalization during evaluation. Improper training of these agents can result in delusions: the agent may come to hold false beliefs about the targets, which cannot be properly rejected, leading to unwanted behaviors and damaging out-of-distribution generalization. We identify different types of delusions by using intuitive examples in carefully controlled environments, and investigate their causes. We demonstrate how delusions can be addressed for agents trained by hindsight relabeling, a mainstream approach in for training target-directed RL agents. We validate empirically the effectiveness of the proposed solutions in correcting delusional behaviors and improving out-of-distribution generalization.

736Provable unlearning in topic modeling and downstream tasks

[openreview] [pdf]

Abstract Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.

737Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

[openreview] [pdf]

Abstract Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose a novel approach that \textbf{e}mphasizes the \textbf{d}iscriminative \textbf{f}eatures (obtained by Grad-CAM) for dataset distillation, called \textbf{EDF}. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. Notably, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to enhance the practicality and efficacy of DD. Our code and benchmark will be made public.

738InverseBench: Benchmarking Plug-and-Play Diffusion Models for Scientific Inverse Problems

[openreview] [pdf]

Abstract Plug-and-play diffusion models have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a unified framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as black hole imaging, seismology, optical tomography, medical imaging, and fluid dynamics. With \textsc{InverseBench}, we benchmark 15 inverse problem algorithms that use plug-and-play diffusion models against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. We open-source the datasets, pre-trained models, and the codebase to facilitate future research and development.

739Novelty Unlocking with Multiobjective Generative Models: Batch Diversity of Human Motions

[openreview] [pdf]

Abstract Current generative models have shown potential performance in many tasks, which typically focus on generating samples that closely adhere to a given distribution, often overlooking the requirement to produce optimal diverse solutions in a batch diversity. Recognizing that maintaining ``diversity" has been a longstanding challenge in multiobjective optimization, we were inspired to introduce a multiobjective optimization approach to enhance diversity in a single pass. This paper utilizes the in-betweening human motion generation task as an example and introduces the multiobjective generative models to demonstrate the effectiveness of the proposed method in producing diverse and smooth human motion sequences. The resulting method, termed the \textit{Multiobjective Generation Framework with In-Betweening Motion Model} (MGF-IMM), frames the human motion in-betweening task as a bi-objective optimization problem. The designed in-betweening motion model is then integrated into a nondominated sorting-based optimization framework to address this bi-objective optimization problem. Through comprehensive qualitative and quantitative experiments, MGF-IMM has demonstrated state-of-the-art performance, surpassing the latest methods and validating its superiority in generating diverse in-betweening human motions.

740A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

[openreview] [pdf]

Abstract Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution---quantifying how individual training data points affect a model’s predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.

741Compressed Decentralized Learning with Error-Feedback under Data Heterogeneity

[openreview] [pdf]

Abstract Decentralized learning distributes the training process across multiple nodes, enabling collaborative model training without relying on a central server. Each node performs local training using its own data, with model updates exchanged directly between connected nodes within a given network topology. Various algorithms have been developed within this decentralized learning framework and have been proven to converge under specific assumptions. However, two key challenges remain: 1) ensuring robust performance with both a high degree of gradient compression and data heterogeneity, and 2) providing a general convergence upper bound under commonly used assumptions. To address these challenges, we propose theDiscounted Error-Feedback Decentralized Parallel Stochastic Gradient Descent (DEFD-PSGD)algorithm, which efficiently manages both high levels of gradient compression and data heterogeneity, without sacrificing communication efficiency. The core idea is to introduce controllable residual error feedback that effectively balances the impact of gradient compression and data heterogeneity. Additionally, we develop novel proof techniques to derive a convergence upper bound under relaxed assumptions. Finally, we present experimental results demonstrating that DEFD-PSGD outperforms other state-of-the-art decentralized learning algorithms, particularly in scenarios involving high compression and significant data heterogeneity.

742Empowering Teachers with Enhanced Knowledge via Variable Scale Distillation Framework

[openreview] [pdf]

Abstract Knowledge distillation, a widely used model compression technique, enables a smaller student network to replicate the performance of a larger teacher network by transferring knowledge, typically in the form of softened class probabilities or feature representations. However, current approaches often fail to maximize the teacher’s feature extraction capabilities, as they treat the semantic information transfer between teacher and student as equal. This paper presents a novel framework that addresses this limitation by enhancing the teacher’s learning process through the Variable Scale Distillation Framework. Central to our approach is the Rescale Block, which preserves scale consistency during hierarchical distillation, allowing the teacher to extract richer, more informative features. In extensive experiments on the CIFAR100 dataset, our method consistently outperforms state-of-the-art distillation techniques, achieving an average accuracy improvement of 2.12%. This demonstrates the effectiveness of our approach in fully leveraging the teacher’s capacity to guide the student, pushing the boundaries of knowledge distillation.

743PaI is getting competitive by training longer

[openreview] [pdf]

Abstract The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their increased training epochs enable improved optimization. To verify this, we show that pruning at initialization (PaI) is significantly boosted by increased training epochs with repeating (cyclic) learning rate schedules akin to iterative pruning, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, increased training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.

744LLMs Can Plan Only If We Tell Them

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.

745Model Extrapolation Expedites Alignment

[openreview] [pdf]

Abstract As the alignment training of large language models (LLMs) usually requires expensive computational resources, exploring more efficient alignment methods to reduce training overhead has always been an important and compelling research challenge. Inspired by prior work onmodel interpolation, we present a simple method calledExPO (model extrapolation)to expedite the alignment of LLMs with human preferences. Based on our observation that interpolating the weights between existing DPO/RLHF models and their initial SFT checkpoints usually produces new models with intermediate performance, we propose to treat a partially-trained model M1\mathcal{M}_1 (corresponding to the intermediate-performing model) as the interpolated result between the initial SFT checkpoint M0\mathcal{M}_0 and a hypothetical better-aligned model M2\mathcal{M}_2. Thus, we can obtain the hypothetical M2\mathcal{M}_2 by simply extrapolating the model weights along the direction from M0\mathcal{M}_0 to M1\mathcal{M}_1, which consequently saves the additional training overhead for M1\mathcal{M}_1 to reach better alignment performance. We validate our hypothesis through controlled experiments, demonstrating that ExPO can boost a DPO model trained with only 20% steps to outperform the fully-trained one. Additionally, we show that ExPO can also notably improve existing open-source LLMs (ranging from 1.8B to 70B parameters), as evidenced by evaluations on the mainstream LLM benchmarks AlpacalEval 2.0 and MT-Bench, which further highlights ExPO’s utility and potential in enabling more efficient LLM alignment.

746Diffusion models for Gaussian distributions: Exact solutions and Wasserstein errors

[openreview] [pdf]

Abstract Diffusion or score-based models recently showed high performance in image generation. They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by solving numerically the backward SDE or its associated flow ODE. Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization and the score approximation. In this paper, we study theoretically the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian. In this restricted framework where the score function is a linear operator, we derive the analytical solutions of the backward SDE and the probability flow ODE. We prove that these solutions and their discretizations are all Gaussian processes, which allows us to compute exact Wasserstein errors induced by each error type for any sampling scheme. Monitoring convergence directly in the data space instead of relying on Inception features, our experiments show that the recommended numerical schemes from the diffusion models literature are also the best sampling schemes for Gaussian distributions.

747Channel-wise Influence: Estimating Data Influence for Multivariate Time Series

[openreview] [pdf]

Abstract The influence function, a robust statistics technique, is an effective post-hoc method that measures the impact of modifying or removing training data on model parameters, offering valuable insights into model interpretability without requiring costly retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation. Additionally, we demonstrate how this influence function can be used to estimate the influence of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world datasets, the original influence function performs worse than our method and even fails for the channel pruning problem, which demonstrates the superiority and necessity of the channel-wise influence function in MTS analysis.

748Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View

[openreview] [pdf]

Abstract Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates an intriguing, non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate’s oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions, respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can naturally emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints’ decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple pretrained language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

749Addressing Label Shift in Distributed Learning via Entropy Regularization

[openreview] [pdf]

Abstract We address the challenge of minimizingtrue riskin multi-node distributed learning. These systems are frequently exposed to both inter-node and intra-nodelabel shifts, which present a critical obstacle to effectively optimizing model performance while ensuring that data remains confined to each node. To tackle this, we propose the Versatile Robust Label Shift (VRLS) method, which enhances the maximum likelihood estimation of the test-to-train label density ratio. VRLS incorporates Shannon entropy-based regularization and adjusts the density ratio during training to better handle label shifts at the test time. In multi-node learning environments, VRLS further extends its capabilities by learning and adapting density ratios across nodes, effectively mitigating label shifts and improving overall model performance. Experiments conducted on MNIST, Fashion MNIST, and CIFAR-10 demonstrate the effectiveness of VRLS, outperforming baselines by up to 20% in imbalanced settings. These results highlight the significant improvements VRLS offers in addressing label shifts. Our theoretical analysis further supports this by establishing high-probability bounds on estimation errors.

750Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

[openreview] [pdf]

Abstract Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users’ true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs’ ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT’s efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs’ ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

751Theory on Mixture-of-Experts in Continual Learning

[openreview] [pdf]

Abstract Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.

752Learning in complex action spaces without policy gradients

[openreview] [pdf]

Abstract Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O’Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.

753CycleVTON: Improving Diffusion-Based Virtual Try-On with Cycle-Consistent Training

[openreview] [pdf]

Abstract We present CycleVTON, a cycle-consistent diffusion-based virtual try-on framework. Unlike existing methods that rely on a single try-on network, our model consists of two conjugated networks. In addition to the regular try-on network, we design a clothing extraction network that extracts the clothing worn by the person and standardizes it into a front-facing format. These two networks are symmetrical, enabling alignment between the generated dressed human and real images of dressed human, as well as between the extracted clothing and its front-facing ground truth. This cycle-consistent optimization strategy allows for enhanced retention of clothing textures and structures, ensuring a more realistic and accurate clothing generation in virtual try-on scenarios. Moreover, the conjugated network structure not only supports traditional virtual try-on but also allows flexible clothing extraction and clothing exchange between different individuals. The experiments on VITON-HD demonstrate the effectiveness of our approach.

754DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

[openreview] [pdf]

Abstract We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

755Forgetting Order of Continual Learning: What is Learned First is Forgotten Last

[openreview] [pdf]

Abstract Catastrophic forgetting poses a significant challenge in continual learning, where models often forget previous tasks when trained on new data. Our empirical analysis reveals a strong correlation between catastrophic forgetting and the learning speed of examples: examples learned early are rarely forgotten, while those learned later are more susceptible to forgetting. We demonstrate that replay-based continual learning methods can leverage this phenomenon by focusing on mid-learned examples for rehearsal. We introduce Goldilocks, a novel replay buffer sampling method that filters out examples learned too quickly or too slowly, keeping those learned at an intermediate speed. Goldilocks improves existing continual learning algorithms, leading to state-of-the-art performance across several image classification tasks.

756SFW sampling for diffusion models via external conditioning

[openreview] [pdf]

Abstract Score-based generative models (SBM), also known as diffusion models, are the de facto state of the art for image synthesis. Despite their unparalleled performance, SBMs have recently been in the spotlight for being tricked into creating not-safe-for-work (NSFW) content, such as violent images and non-consensual nudity. This article proposes a safe-for-work (SFW) sampler for SBMs implementing a Conditional Trajectory Correction step that guides the samples away from undesired regions in the ambient space using external multimodal models as the source of conditioning. Furthermore, using Contrastive Language Image Pre-training (CLIP), our method admits user-defined NSFW classes, which can vary in different settings. Our experiments on the text-to-image SBM Stable Diffusion validate that the proposed SFW sampler effectively reduces the generation of explicit content, as assessed via independent NSFW detectors. Furthermore, the proposed correction comes at a minor cost in image quality and has an almost null effect on samples that do not need correction. Our study confirms the suitability of the SFW sampler towards aligned SBM models.

757Teaching Transformers Causal Reasoning through Axiomatic Training

[openreview] [pdf]

Abstract For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since active interventions are costly to execute, we study to what extent an agent can learn causal reasoning from symbolic demonstrations of causal axioms. Specifically, we consider an axiomatic training setup where an agent learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the agent would learn to generalize from the axiom demonstrations to new scenarios. For example, if a transformer model is trained on demonstrations of the causal transitivity axiom over small graphs, would it generalize to applying the transitivity axiom over large graphs? Our results, based on a novel axiomatic training scheme, indicate that such generalization is possible. For the transitivity axiom, we find that a 67 million parameter transformer model, when trained on linear causal chains (along with some noisy variations) can generalize well to new kinds of graphs, including longer causal chains, causal chains with reversed order, and graphs with branching; even when it is not explicitly trained for such settings. We extend axiomatic training to a harder task of inferring causation from correlation statements and find similar generalization. On both tasks, our model performs at par (or even better) than many larger language models such as GPT-4, Gemini Pro, and Phi-3. Overall, our axiomatic training framework provides a new paradigm of learning causal reasoning in language models that can be extended to arbitrary axioms, as long as sufficient demonstrations can be generated.

758Lookahead Shielding for Regular Safety Properties in Reinforcement Learning

[openreview] [pdf]

Abstract To deploy reinforcement learning (RL) systems in real-world scenarios we need to consider requirements such as safety and constraint compliance, rather than blindly maximizing for reward. In this paper we develop a lookahead shielding framework for RL with regular safety properties, which on the contrary to prior shielding methodologies requires minimal prior knowledge. At each environment step our framework aims to satisfy the regular safety property for a bounded horizon with high-probability, for the tabular setting we provide provable guarantees. We compare our setup to some common algorithms developed for the constrained Markov decision process (CMDP), and we demonstrate the effectiveness and scalability of our framework by extensively evaluating our framework in both tabular and deep RL environments.

759Target-Oriented Soft-Robust Inverse Reinforcement Learning

[openreview] [pdf]

Abstract In imitation learning, when the learning agent is at a state that is outside the demonstration of the expert, it could be difficult for her to choose an action. To overcome this challenge, inverse reinforcement learning (IRL) learns a parameterized reward function based on which we can generalize the expert’s behavior to those states that are unseen in the demonstration. However, on the one hand, there could be multiple reward functions that can explain the expert’s behavior, leading to reward ambiguity in IRL. On the other hand, though we often consider the transition kernel of the expert to be known to the agent, sometimes the transition kernel of the agent is different from the expert’s and is unknown, leading to transition kernel ambiguity in IRL. Drawing on the notion of soft-robust optimization, we build a target-oriented soft-robust IRL (SRIRL) model where the performance of the output policy strikes a flexible balance between risk aversion and expected return maximization towards reward uncertainty in IRL. Moreover, by employing the robust satisficing framework, our SRIRL is also robust to transition kernel ambiguity in IRL. In our target-oriented SRIRL, we keep a target for the performance of the output policy that balances expected return and risk, and we minimize the constraint violation incurred by the difference between the ambiguous transition kernel and the empirical one. We derive tractable reformulation for SRIRL, and we design tailored first-order methods for SRIRL. Numerical results showcase the soft robustness towards reward uncertainty and the robustness against transition kernel ambiguity of SRIRL, as well as the stronger scalability of our first-order methods compared to a state-of-the-art commercial solver.

760Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations

[openreview] [pdf]

Abstract A major barrier to the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. In all three cases, models should ideally abstain from responding---much like humans refrain from answering questions when uncertain. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of LLMs abstaining when uncertain in the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In Dialogue Uncertainty (InDU), measuring hedge words such as `I don’t know’ in responses. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By abstaining for a few highly uncertain samples we improve correctness by up to 8%, avoid 50% of hallucinations by correctly identifying unanswerable questions, and in particular increase safety by 70-99% with almost no additional computational overhead.

761ON EXTRAPOLATION IN MATERIAL PROPERTY REGRESSION

[openreview] [pdf]

Abstract Deep learning methods have yielded exceptional performances in material property regression (MPR). However, most existing methods operate under the assumption that the training and test are independent and identically distributed (i.i.d.). This overlooks the importance of extrapolation - predicting material properties beyond the range of training data - which is essential for advanced material discovery, as researchers strive to identify materials with exceptional properties that exceed current capabilities. In this paper, we address this gap by introducing a comprehensive benchmark comprising seven tasks specifically designed to evaluate extrapolation in MPR. We critically evaluate existing methods including deep imbalanced regression (DIR) and regression data augmentation (DA) methods, and reveal their limitations in extrapolation tasks. To address these issues, we propose the Matching-based EXtrapolation (MEX) framework, which reframes MPR as a material-property matching problem to alleviate the inherent complexity of the direct material-to-label mapping paradigm for better extrapolation. Our experimental results show that MEX outperforms all existing methods on our benchmark and demonstrates exceptional capability in identifying promising materials, underscoring its potential for advancing material discovery.

762Combating the Generalization-Forgetting Trade-off in Continual Learning: A Cautious Passive Low-Rank Approach

[openreview] [pdf]

Abstract Large Language Models (LLMs) have shown remarkable capabilities through wide-scale pre-training on a wide range of domains. However, they often suffer from catastrophic forgetting when learning sequential tasks. In this paper, we propose a novel parameter-efficient approach for continual learning in LLMs, which empirically explores the role of different effective layerwise ranks, leveraging lower ranks to mitigate catastrophic forgetting of previous tasks and higher ranks to enhance generalization on new tasks. By employing a subspace similarity metric that evaluates the orthogonality of low-rank subspaces between tasks, we gradually increase the rank of layerwise matrices for each new task, minimizing interference with previously learned tasks while enhancing generalization. Experimental results on standard continual learning benchmarks and challenging math benchmarks demonstrate that our method outperforms existing state-of-the-art approaches, effectively mitigating forgetting, improving task performance, and maintaining strong generalization to unseen tasks in a memory-efficient manner.

763Mitigating Generative Privacy Risks of Diffusion Models via Mixed Self-Synthesized Data Fine-tuning

[openreview] [pdf]

Abstract Diffusion models (DMs) have demonstrated exceptional performance across various generative tasks, yet they also face significant security and privacy concerns, such as Membership Inference Attacks (MIAs), where adversaries attempt to determine whether specific images were part of the DM’s training set. These threats present serious risks, particularly as pre-trained DMs are increasingly accessible online. To address these privacy concerns, we begin by investigating how fine-tuning DMs on a manipulated self-synthesized dataset affects their generative privacy risks, and have the following observations: (1) DMs fine-tuned solely on self-synthesized clean images are more vulnerable to privacy attacks (2) DMs fine-tuned on perturbed self-synthesized images become more robust against privacy attacks but exhibit degraded image generation quality. Based on the observations, we propose MixSyn, a simple and effective framework designed to mitigate privacy risks by fine-tuning DMs on a mixed self-synthesized dataset, which is a mixture of clean and perturbed synthetic images. Extensive experimental results demonstrate that our method significantly mitigates the generative privacy risks of DMs while preserving their original image generation quality.

764Generalization Gradient Descent

[openreview] [pdf]

Abstract We propose a new framework for evaluating the relationship between features and generalization via a theoretical analysis of the out-of-distribution (OOD) generalization problem, in which we simultaneously use two mathematical methods: a generalization ratio that quantitatively characterizes the degree of generalization, and a generalization decision process (GDP) that formalizes the relationship of loss between seen and unseen domains. By combining the concepts of informativeness and variation in the generalization ratio, we intuitively associate them with OOD problems to derive the generalization inequality. We then introduce it to the GDP to select the best loss from seen domains to gradient descent for backpropagation. In the case where the classifier is defined by fully connected neural network, the entire system is trained with backpropagation. There is no need for any model selection criterion or operating on gradients during training. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generalization ability.

765Provable Post-Deployment Deterioration Monitoring

[openreview] [pdf]

Abstract Data distribution often changes when deploying a machine learning model into a new environment, but not all shifts degrade model performance, making interventions like retraining unnecessary. This paper addresses model post-deployment deterioration (PDD) monitoring in the context of unlabeled deployment distributions. We formalize unsupervised PDD monitoring within the model disagreement framework where deterioration is detected if an auxiliary model, performing well on training data, shows significant prediction disagreement with the deployed model on test data. We propose D-PDDM, a principled monitoring algorithm achieving low false positive rates under non-deteriorating shifts and provide sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale healthcare dataset demonstrate the effectiveness of the framework in addition to its viability as an alert mechanism for existing high-stakes ML pipelines.

766Weighted-Rank Contrastive Regression for Robust Learning on Imbalance Social Media Popularity Prediction

[openreview] [pdf]

Abstract Social Media Popularity Prediction (SMPP) is the task of forecasting the level of engagement a social media post will receive. It is crucial for understanding audience engagement and enabling targeted marketing strategies. However, the inherent imbalance in real-world social media data, where certain popularity levels are underrepresented, poses a significant challenge. In this study, we leveraged the recent success of contrastive learning and its growing integration into regression tasks by introducing a Weighted-Rank CR loss to address the data imbalance challenges. Experiments on the Social Media Prediction Dataset demonstrated that our method outperformed the vanilla approach and the current state-of-the-art contrastive regression approach Rank-N-Contrast.

767Endless Jailbreaks with Bijection Learning

[openreview] [pdf]

Abstract Despite extensive safety training, LLMs are vulnerable to adversarial inputs. In this work, we introduce a simple but powerful attack paradigm, bijection learning, that yields a practically endless set of jailbreak prompts. We exploit language models’ advanced reasoning capabilities to teach them invertible languages (bijections) in context, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English, yielding helpful replies to harmful requests. Our approach proves effective on a wide range of frontier language models and harm categories. Bijection learning is an automated and universal attack that grows stronger with scale: larger models with more advanced reasoning capabilities are more susceptible to bijection learning jailbreaks despite stronger safety mechanisms.

768Global Convergence of Policy Gradient in Average Reward MDPs

[openreview] [pdf]

Abstract We present the first comprehensive finite-time global convergence analysis of policy gradient for infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of O(1T),O({\frac{1}{T}}), which translates to O(log(T))O({\log(T)}) regret, where TT represents the number of iterations. Performance bounds for discounted reward MDPs cannot be easily extended to average reward MDPs as the bounds grow proportional to the fifth power of the effective horizon. Recent work on such extensions make a smoothness assumption that has not been verified. Thus, our primary contribution is in providing the first complete proof that the policy gradient algorithm converges globally for average-reward MDPs, without such an assumption. We also obtain the corresponding finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations which empirically validate the result.

769Entropy-Based Aggregation for Fair and Effective Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) enables collaborative model training across distributed devices while preserving data privacy. Nonetheless, the heterogeneity of edge devices often leads to inconsistent performance of the globally trained models, resulting in unfair outcomes among users. Existing federated fairness algorithms strive to enhance fairness but often fall short in maintaining the overall performance of the global model, typically measured by the average accuracy across all clients. To address this issue, we propose a novel algorithm that leverages entropy-based aggregation combined with model and gradient alignments to simultaneously optimize fairness and global model performance. Our method employs a bi-level optimization framework, where we derive an analytic solution to the aggregation probability in the inner loop, making the optimization process computationally efficient. Additionally, we introduce an innovative alignment update and an adaptive strategy in the outer loop to further balance global model’s performance and fairness. Theoretical analysis indicates that our approach guarantees convergence even in non-convex FL settings and demonstrates significant fairness improvements in generalized regression and strongly convex models. Empirically, our approach surpasses state-of-the-art federated fairness algorithms, ensuring consistent performance among clients while improving the overall performance of the global model.

770FedMAP: Unlocking Potential in Personalized Federated Learning through Bi-Level MAP Optimization

[openreview] [pdf]

Abstract Federated Learning (FL) enables collaborative training of machine learning (ML) models on decentralized data while preserving data privacy. However, data across clients often differs significantly due to class imbalance, feature distribution skew, sample size imbalance, and other phenomena. Using information from these not identically distributed (non-IID) datasets causes challenges in training. Existing FL methods based on a single global model cannot effectively capture client data variations, resulting in suboptimal performance. Personalized FL (PFL) techniques were introduced to adapt to the local data distribution of each client and utilize the data from other clients. They have shown promising results in addressing these challenges. We propose FedMAP, a novel Bayesian PFL framework which applies Maximum A Posteriori (MAP) estimation to effectively mitigate various non-IID data issues, by means of a parametric prior distribution, which is updated during aggregation. We provide a theoretical foundation illustrating FedMAP’s convergence properties. In particular, we prove that the prior updates in FedMAP correspond to gradient descent iterations for a linear combination of envelope functions associated with the local losses. This differs from previous FL approaches, that aim at minimizing a weighted average of local loss functions and often face challenges with heterogeneous data distributions, resulting in reduced client performance and slower convergence in non-IID settings. Finally, we show, through evaluations of synthetic and real-world datasets, that FedMAP achieves better performance than the existing methods. Moreover, we offer a robust, ready-to-use framework to facilitate practical deployment and further research.

771Direct Preference Optimization With Unobserved Preference Heterogeneity

[openreview] [pdf]

Abstract RLHF has emerged as a pivotal step in aligning language models with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.

772STABLE DIFFUSION MODELS ARE SECRETLY GOOD AT VISUAL IN-CONTEXT LEARNING

[openreview] [pdf]

Abstract Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few set of example prompts to adapt to various tasks without having to explicitly update model weights. ICL has recently been explored for the visual domain with promising early outcomes. These approaches involve specialized training and/or additional data which complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be re-purposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this re-purposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance across all tasks.

773Jacobian Descent for Multi-Objective Optimization

[openreview] [pdf]

Abstract Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.

774Towards Off-Road Autonomous Driving via Planner Guided Policy Optimization

[openreview] [pdf]

Abstract Off-road autonomous driving poses significant challenges such as navigating diverse terrains, avoiding obstacles, and maneuvering through ditches. Addressing these challenges requires effective planning and adaptability, making it a long-horizon planning and control problem. Traditional model-based control techniques like Model Predictive Path Integral (MPPI) require dense sampling and accurate modeling of the vehicle-terrain interaction, both of which are computationally expensive, making effective long-horizon planning in real-time intractable. Reinforcement learning (RL) methods operate without this limitation and are computationally cheaper at deployment. However, exploration in obstacle-dense and challenging terrains is difficult, and typical RL techniques struggle to navigate in these terrains. To alleviate the limitations of MPPI, we propose a hierarchical autonomy pipeline with a low-frequency high-level MPPI planner and a high-frequency low-level RL controller. To tackle RL’s exploration challenge, we propose a teacher-student paradigm to learn an end-to-end RL policy, capable of real-time execution and traversal through challenging terrains. The teacher policy is trained using dense planning information from an MPPI planner while the student policy learns to navigate using visual inputs and sparse planning information. In this framework, we introduce a new policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. We demonstrate our performance in a realistic off-road simulator against various RL and imitation learning methods.

775WMAdapter: Adding WaterMark Control to Latent Diffusion Models

[openreview] [pdf]

Abstract Watermarking is essential for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that embeds user-specified watermark information seamlessly during the diffusion generation process. Unlike previous methods that modify diffusion modules to incorporate watermarks, WMAdapter is designed to keep all diffusion components intact, resulting in sharp, artifact-free images. To achieve this, we introduce two key innovations: (1) We develop a contextual adapter that conditions on the content of the cover image to generate adaptive watermark embeddings. (2) We implement an additional finetuning step and a hybrid finetuning strategy that suppresses noticeable artifacts while preserving the integrity of the diffusion components. Empirical results show that WMAdapter provides strong flexibility, superior image quality, and competitive watermark robustness.

776Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

[openreview] [pdf]

Abstract Based on the success of large image diffusion models, multi-view diffusion models have demonstrated remarkable zero-shot capability in novel view synthesis (NVS). However, the pioneering work Zero123 struggles to maintain consistency across generated multiple views. While recent modifications in model and training design have improved multi-view consistency, they often introduce new limitations, such as restricted fixed view generation or reliance on additional conditions. These constraints hinder the broader application of multi-view diffusion models in downstream tasks like 3D reconstruction. We identify the root cause of inconsistency as the excessive diversity inherent in generative models utilized for the NVS task. To address this, we aim to utilize the stronger supervise information to better alignment with ground truth images to constrain the diversity, and propose Ctrl123, aclosed-looptranscription-based multi-view diffusion method that enforces alignment in the CLIP patch feature space. Extensive experiments demonstrate that Ctrl123 excels inarbitrarynovel view generation, significantly improving multi-view consistency compared to existing methods.

777xTED: Cross-Domain Adaptation via Diffusion-Based Trajectory Editing

[openreview] [pdf]

Abstract Reusing pre-collected data from different domains is an appealing solution for decision-making tasks that have insufficient data in the target domain but are relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, such as learning domain/task-specific discriminators, representations, or policies. This design philosophy often results in heavy model architectures or task/domain-specific modeling, lacking flexibility. This reality makes us wonder: can we directly bridge the domain gaps universally at the data level, instead of relying on complex downstream cross-domain policy transfer models? In this study, we propose theCross-DomainTrajectoryEDiting (xTED) framework that employs a specially designed diffusion model for cross-domain trajectory adaptation. Our proposed model architecture effectively captures the intricate dependencies among states, actions, and rewards, as well as the dynamics patterns within target data. By utilizing the pre-trained diffusion as a prior, source domain trajectories can be transformed to match with target domain properties while preserving original semantic information. This process implicitly corrects underlying domain gaps, enhancing state realism and dynamics reliability in the source data, and allowing flexible incorporation with various downstream policy learning methods. Despite its simplicity, xTED demonstrates superior performance in extensive simulation andreal-robot experiments.

778Enhancing Logits Distillation with Plug&Play Kendall’sτRanking Loss

[openreview] [pdf]

Abstract Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the output of the student model to precisely match the soft labels provided by the teacher model. However, the optimization process of KL divergence is challenging for the student and prone to suboptimal points. Also, we demonstrate that the gradients provided by KL divergence depend on channel scale and thus tend to overlook low-probability channels. The mismatch in low-probability channels also results in the neglect of inter-class relationship information, making it difficult for the student to further enhance performance. To address this issue, we propose an auxiliary ranking loss based on Kendall’s τ Coefficient, which can be plug-and-play in any logit-based distillation method, providing inter-class relationship information and balancing the attention to low-probability channels. We show that the proposed ranking loss is less affected by channel scale, and its optimization objective is consistent with that of KL divergence. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that the proposed ranking loss can be plug-and-play on various baselines and enhance their performance.

779Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization

[openreview] [pdf]

Abstract Black-box optimization (BBO) aims to optimize an objective function by iteratively querying a black-box oracle in a sample-efficient way. While prior studies focus on forward approaches to learn surrogates for the unknown objective function, they struggle with steering clear of out-of-distribution and invalid inputs. Recently, inverse modeling approaches that map objective space to the design space with conditional diffusion models have demonstrated impressive capability in learning the data manifold. They have shown promising performance in offline BBO tasks. However, these approaches require a pre-collected dataset. How to design the acquisition function for inverse modeling to actively query new data remains an open question. In this work, we propose diffusion-based inverse modeling for black-box optimization (Diff-BBO), an inverse approach leveraging diffusion models for online BBO problem. Instead of proposing candidates in the design space, Diff-BBO employs a novel acquisition function Uncertainty-aware Exploration (UaE) to propose objective function values. Subsequently, we employ a conditional diffusion model to generate samples based on these proposed values within the design space. We demonstrate that using UaE results in optimal optimization outcomes, supported by both theoretical and empirical evidence.

780How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

[openreview] [pdf]

Abstract The rising successes of RL are propelled by combining smart algorithmic strategies and deep architectures to optimize the distribution of returns and visitations over the state-action space. A quantitative framework to compare the learning processes of these eclectic RL algorithms is currently absent but desired in practice. We address this gap by representing the learning process of an RL algorithm as a sequence of policies generated during training, and then studying the policy trajectory induced in the manifold of occupancy measures. Using an optimal transport-based metric, we measure the length of the paths induced by the policy sequence yielded by an RL algorithm between an initial policy and a final optimal policy. Hence, we first define theEffort of Sequential Learning(ESL). ESL quantifies the relative distance that an RL algorithm travels compared to the shortest path from the initial to the optimal policy. Further, we connect the dynamics of policies in the occupancy measure space and regret, another metric to understand the suboptimality of an RL algorithm, by defining theOptimal Movement Ratio(OMR). OMR assesses the fraction of movements in the occupancy measure space that effectively reduce an analogue of regret. Finally, we derive approximation guarantees to estimate ESL and OMR with finite number of samples and without access to an optimal policy. Through empirical analyses across various environments and algorithms, we demonstrate that ESL and OMR provide insights into the exploration processes of RL algorithms and hardness of different tasks in discrete and continuous MDPs.

781NoisyTraj: Robust Trajectory Prediction with Noisy Observations

[openreview] [pdf]

Abstract Trajectory prediction aims to forecast an agent’s future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations due to errors from sensors, detection, and tracking processes, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose NoisyTraj, a noise-agnostic approach capable of tackling the problem of trajectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. This mechanism optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one. Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss based on the intuitive idea that prediction performance using denoised trajectories should surpass that using the original noisy observations, thereby further enhancing performance. Because NoisyTraj does not rely on any specific module tailored to particular noise distributions, it can handle arbitrary types of noise in principle. Additionally, our proposed NoisyTraj can be easily integrated into existing trajectory prediction models. Extensive experiments conducted on the ETH/UCY and Stanford Drone datasets (SDD) demonstrate that NoisyTraj significantly improves the accuracy of trajectory prediction with noisy observations, compared to the baselines.

782Active Fine-Tuning of Generalist Policies

[openreview] [pdf]

Abstract Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decidewhich tasks should be demonstrated and how often?We study this multi-task problem and explore an interactive framework in which the agentadaptivelyselects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.

783Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning

[openreview] [pdf]

Abstract Offline meta reinforcement learning (OMRL) has emerged as a promising approach for interaction avoidance and strong generalization performance by leveraging pre-collected data and meta-learning techniques. Previous context-based approaches predominantly rely on the intuition that alternating optimization between the context encoder and the policy can lead to performance improvements, as long as the context encoder follows the principle of maximizing the mutual information between the task variable MM and its latent representation ZZ (I(Z;M)I(Z;M)) while the policy adopts the standard offline reinforcement learning (RL) algorithms conditioning on the learned task representation. Despite promising results, the theoretical justification of performance improvements for such intuition remains underexplored. Inspired by the return discrepancy scheme in the model-based RL field, we find that the previous optimization framework can be linked with the general RL objective of maximizing the expected return, thereby explaining performance improvements. Furthermore, after scrutinizing this optimization framework, we find it ignores the variation of the task representation in the alternating optimization process, which weakens the condition necessary for monotonic performance improvements, and may therefore violate the monotonicity. We name this issue \underline{task representation shift} and theoretically prove that the monotonic performance improvements can be guaranteed with appropriate context encoder updates. We use different settings to rein in the task representation shift on three widely adopted training objectives concerning maximizing I(Z;M)I(Z;M) across different data qualities. Empirical results show that reining in the task representation shift can indeed improve performance. Our work opens up a new avenue for OMRL, leading to a better understanding between the task representation and performance improvements.

784A Dual-Fusion Cognitive Diagnosis Framework for Open Student Learning Environments

[openreview] [pdf]

Abstract Cognitive diagnosis model (CDM) is a fundamental and upstream component in intelligent education. It aims to infer students’ mastery levels based on historical response logs. However, existing CDMs usually follow the ID-based embedding paradigm, which could often diminish the effectiveness of CDMs in open student learning environments. This is mainly because they can hardly directly infer new students’ mastery levels or utilize new exercises or knowledge without retraining. Textual semantic information, due to its unified feature space and easy accessibility, can help alleviate this issue. Unfortunately, directly incorporating semantic information may not benefit CDMs, since it does not capture response-relevant features and thus discards the individual characteristics of each student. To this end, this paper proposes a dual-fusion cognitive diagnosis framework (DFCD) to address the challenge of aligning two different modalities, i.e., textual semantic features and response-relevant features. Specifically, in DFCD, we first propose the exercise-refiner and concept-refiner to make the exercises and knowledge concepts more coherent and reasonable via large language models. Then, DFCD encodes the refined features using text embedding models to obtain the semantic information. For response-related features, we propose a novel response matrix to fully incorporate the information within the response logs. Finally, DFCD designs a dual-fusion module to merge the two modal features. The ultimate representations possess the capability of inference in open student learning environments and can be also plugged in existing CDMs. Extensive experiments across real-world datasets show that DFCD achieves superior performance by integrating different modalities and strong adaptability in open student learning environments.

785Almost sure convergence of stochastic Hamiltonian descent methods

[openreview] [pdf]

Abstract Gradient normalization and soft clipping are two popular techniques for tackling instability issues and improving convergence of stochastic gradient descent (SGD) with momentum. In this article, we study these types of methods through the lens of dissipative Hamiltonian systems. Gradient normalization and certain types of soft clipping algorithms can be seen as (stochastic) implicit-explicit Euler discretizations of dissipative Hamiltonian systems, where the kinetic energy function determines the type of clipping that is applied. We make use of dynamical systems theory to show in a unified way that all of these schemes converge to stationary points of the objective function, almost surely, in several different settings: a) for LL-smooth objective functions, when the variance of the stochastic gradients is possibly infinite b) under the (L0,L1)(L_0,L_1)-smoothness assumption, for heavy-tailed noise with bounded variance and c) for (L0,L1)(L_0,L_1)-smooth functions in the empirical risk minimization setting, when the variance is possibly infinite but the expectation is finite.

786Leveraging Variable Sparsity to Refine Pareto Stationarity in Multi-Objective Optimization

[openreview] [pdf]

Abstract Gradient-based multi-objective optimization (MOO) is essential in modern machine learning, with applications in e.g., multi-task learning, federated learning, algorithmic fairness and reinforcement learning. In this work, we first reveal some limitations of Pareto stationarity, a widely accepted first-order condition for Pareto optimality, in the presence of sparse function-variable structures. Next, to account for such sparsity, we propose a novel solution concept termed Refined Pareto Stationarity (RPS), which we prove is always sandwiched between Pareto optimality and Pareto stationarity. We give an efficient partitioning algorithm to automatically mine the function-variable dependency and substantially trim non-optimal Pareto stationary solutions. Then, we show that gradient-based descent algorithms in MOO can be enhanced with our refined partitioning. In particular, we propose Multiple Gradient Descent Algorithm with Refined Partition (RP-MGDA) as an example method that converges to RPS, while still enjoying a similar per-step complexity and convergence rate. Lastly, we validate our approach through experiments on both synthetic examples and realistic application scenarios where distinct function-variable dependency structures appear. Our results highlight the importance of exploiting function-variable structure in gradient-based MOO, and provide a seamless enhancement to existing approaches.

787Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

[openreview] [pdf]

Abstract In Open-Set Domain Adaptation (OSDA) we wish to perform classification in a target domain which contains a novel class along with kk non-novel classes. This work formally studies OSDA under the assumption that classes are separable, and the supports of source and target domains coincide, while other aspects of the distribution may change. We develop a simple and scalable method that attains robustness to distribution shift and is guaranteed to solve the problem, while showing that it cannot be solved under weaker conditions that have been studied for OSDA in the past, particularly in the presence of covariate shift. We formally define the realistic assumptions within the scope of OSDA problem that the previous literature has either overlooked or not explicitly addressed. In a thorough empirical evaluation on both image and text data, we observe that existing OSDA methods are not robust to the distribution shifts we consider. The results demonstrate the efficacy of joint representation learning for classification of known classes and detection of novel ones using principled methods. We find that optimizing these two objectives in unison leads to mutual improvements in task performance contrary to what might be expected when objectives are considered independently. Our rigorous empirical study also examines how OSDA performance under distribution shift is affected by parameters of the problem such as the size of novel class. Taken together, our observations emphasize the importance of formalizing assumptions under which OSDA methods operate and to develop appropriate methodology that are capable of scaling with large datasets and models for different scenarios of OSDA.

788Transformers versus LSTMs for electronic trading

[openreview] [pdf]

Abstract The rapid advancement of artificial intelligence has seen widespread application of long short-term memory (LSTM), a type of recurrent neural network (RNN), in time series forecasting. Despite the success of Transformers in natural language processing (NLP), which prompted interest in their efficacy for time series prediction, their application in financial time series forecasting is less explored compared to the dominant LSTM models. This study investigates whether Transformer-based models can outperform LSTMs in financial time series forecasting. It involves a comparative analysis of various LSTM-based and Transformer-based models on multiple financial prediction tasks using high-frequency limit order book data. A novel LSTM-based model named DLSTM is introduced alongside a newly designed Transformer-based model tailored for financial predictions. The findings indicate that Transformer-based models exhibit only a marginal advantage in predicting absolute price sequences, whereas LSTM-based models demonstrate superior and more consistent performance in predicting differential sequences such as price differences and movements.

789Unified Framework for Causal Discovery and Long-term Forecasting in Non-stationary Environments

[openreview] [pdf]

Abstract Non-stationary data is prevalent in various real-world domains such as climate science, economics, and neuroscience, presenting significant challenges for tasks like forecasting and causal discovery from observational data. Existing approaches often operate under the assumption that the data is stationary. In this work, we introduce a unified framework that combines long-term forecasting and causal discovery with non-linear relations in a non-stationary setting. Specifically, we assume that the nonlinear causal relations in the observed space can be transformed into linear relations in the latent space via projections. In addition, we model the non-stationarity in the system as arising from time-varying causal relations. The proposed model demonstrates that adopting a causal perspective for long-term forecasting not only addresses the limitations of each task but also makes the causal process identifiable, enhances interpretability, and provides more reliable predictions. Moreover, our approach reformulates causal discovery into a scalable, non-parametric deep learning problem. Through experiments on both synthetic and real-world datasets, we show that our framework outperforms baseline methods in both forecasting and causal discovery, underscoring the benefits of this integrated approach.

790TLXML: Task-Level Explanation of Meta-Learning via Influence Functions

[openreview] [pdf]

Abstract The scheme of adaptation via meta-learning is seen as an ingredient for solving the problem of data shortage or distribution shift in real-world applications, but it also brings the new risk of inappropriate updates of the model in the user environment, which increases the demand for explainability. Among the various types of XAI methods, establishing a method of explanation based on past experience in meta-learning requires special consideration due to its bi-level structure of training, which has been left unexplored. In this work, we propose influence functions for explaining meta-learning that measure the sensitivities of training tasks to adaptation and inference. We also argue that the approximation of the Hessian using the Gauss-Newton matrix resolves computational barriers peculiar to meta-learning. We demonstrate the adequacy of the method through experiments on task distinction and task distribution distinction using image classification tasks with MAML and Prototypical Network.

791No Preference Left Behind: Group Distributional Preference Optimization

[openreview] [pdf]

Abstract Preferences within a group of people are not uniform but follow a distribution. While existing alignment methods like Direct Preference Optimization (DPO) attempt to steer models to reflect human preferences, they struggle to capture the distributional pluralistic preferences within a group. These methods often skew toward dominant preferences, overlooking the diversity of opinions, especially when conflicting preferences arise. To address this issue, we propose Group Distribution Preference Optimization (GDPO), a novel framework that aligns language models with the distribution of preferences within a group by incorporating the concept of beliefs that shape individual preferences. GDPO calibrates a language model using statistical estimation of the group’s belief distribution and aligns the model with belief-conditioned preferences, offering a more inclusive alignment framework than traditional methods. In experiments using both synthetic controllable opinion generation and real-world movie review datasets, we show that DPO fails to align with the targeted belief distributions, while GDPO consistently reduces this alignment gap during training. Additionally, our evaluation metrics demonstrate that GDPO outperforms existing approaches in aligning with group distributional preferences, marking a significant advance in pluralistic alignment.

792Cluster-Segregate-Perturb (CSP): A Model-agnostic Explainability Pipeline for Spatiotemporal Land Surface Forecasting Models

[openreview] [pdf]

Abstract Satellite images are increasingly valuable for modeling regional climate change. Earth surface forecasting is one task that combines satellite imagery and meteorological data to understand how climate evolves over time. However, understanding the complex relationship between meteorological variables and land surface changes remains a challenge. Our paper introduces a pipeline that integrates principles from perturbation-based techniques like LIME and global explainability techniques methods like PDP, addressing the limitations of these techniques in high-dimensional spatiotemporal models. This pipeline facilitates analyses such as marginal sensitivity, correlation, and lag analysis, etc for complex land forecasting models. Using ConvLSTM for surface forecasting, we analyzed influence of variables like temperature, pressure, and precipitation on the NDVI of the surface predictions. Our study in EarthNet2021 Dataset (primarily consists of samples from the European Alps region, collected during the spring to fall seasons) revealed that precipitation had the greatest impact, followed by temperature, while pressure has little to no direct effect on NDVI. Additionally, interesting nonlinear correlations between meteorological variables and NDVI have been uncovered.

793HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

[openreview] [pdf]

Abstract Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we proposeHarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, “Make a single harmful instruction prompt that would elicit offensive content”, we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the LLM’s response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost. Ourcode,safety guard model, andsynthetic datasetare publicly available.

794Preference Elicitation for Offline Reinforcement Learning

[openreview] [pdf]

Abstract Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in various environments.

795Flashback: Understanding and Mitigating Forgetting in Federated Learning

[openreview] [pdf]

Abstract In Federated Learning (FL), forgetting, or the loss of knowledge across rounds, hampers algorithm convergence, especially in the presence of severe data heterogeneity among clients. This study explores the nuances of this issue, emphasizing the critical role of forgetting leading to FL’s inefficient learning within heterogeneous data contexts. Knowledge loss occurs in both client-local updates and server-side aggregation steps; addressing one without the other fails to mitigate forgetting. We introduce a metric to measure forgetting granularly, ensuring distinct recognition amid new knowledge acquisition. Based on this, we propose Flashback, a novel FL algorithm with a dynamic distillation approach that regularizes the local models and effectively aggregates their knowledge. The results from extensive experimentation across different benchmarks show that Flashback mitigates forgetting and outperforms other state-of-the-art methods, achieving faster round-to-target accuracy by converging in 6 to 16 rounds, being up to 27× faster.

796Auction-Based Regulation for Artificial Intelligence

[openreview] [pdf]

Abstract In an era of “moving fast and breaking things”, regulators have moved slowly to pick up the safety, bias, and legal pieces left in the wake of broken Artificial Intelligence (AI) deployment. Since AI models, such as large language models, are able to push misinformation and stoke division within our society, it is imperative for regulators to employ a framework that mitigates these dangers and ensures user safety. While there is much-warranted discussion about how to address the safety, bias, and legal woes of state-of-the-art AI models, the number of rigorous and realistic mathematical frameworks to regulate AI safety is lacking. We take on this challenge, proposing an auction-based regulatory mechanism that provably incentivizes model-building agents (i) to deploy safer models and (ii) to participate in the regulation process. We provably guarantee, via derived Nash Equilibria, that each participating agent’s best strategy is to submit a model safer than a prescribed minimum-safety threshold. Empirical results show that our regulatory auction boosts safety and participation rates by 20% and 15% respectively, outperforming simple regulatory frameworks that merely enforce minimum safety standards.

797Fair Anomaly Detection For Imbalanced Groups

[openreview] [pdf]

Abstract Anomaly detection (AD) has been widely studied for decades in many real-world applications, including fraud detection in finance, and intrusion detection for cybersecurity, etc. Due to the imbalanced nature between protected and unprotected groups and the imbalanced distributions of normal examples and anomalies, the learning objectives of most existing anomaly detection methods tend to solely concentrate on the dominating unprotected group. Thus, it has been recognized by many researchers about the significance of ensuring model fairness in anomaly detection. However, the existing fair anomaly detection methods tend to erroneously label most normal examples from the protected group as anomalies in the imbalanced scenario where the unprotected group is more abundant than the protected group. This phenomenon is caused by the improper design of learning objectives, which statistically focus on learning the frequent patterns (i.e., the unprotected group) while overlooking the under-represented patterns (i.e., the protected group). To address these issues, we propose FADIG, a fairness-aware anomaly detection method targeting the imbalanced scenario. It consists of a fairness-aware contrastive learning module and a rebalancing autoencoder module to ensure fairness and handle the imbalanced data issue, respectively. Moreover, we provide the theoretical analysis that shows our proposed contrastive learning regularization guarantees group fairness. Empirical studies demonstrate the effectiveness and efficiency of FADIG across multiple real-world datasets.

798Understanding and Mitigating Distribution Shifts for Machine Learning Force Fields

[openreview] [pdf]

Abstract Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. Our diagnostic experiments on real-world datasets reveal common distribution shifts that pose significant challenges, including for large foundation models trained on extensive datasets. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. Based on our observations, we propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. It can be applied to any existing pre-trained model to mitigate connectivity distribution shifts. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective. We demonstrate that our test-time refinement strategies can reduce force errors by an order of magnitude on out-of-distribution systems, suggesting that MLFFs are capable of modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs.

799Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps

[openreview] [pdf]

Abstract Humor is a culturally nuanced aspect of human language that presents challenges for understanding and generation, requiring participants to possess good creativity and strong associative thinking. Similar to reasoning tasks like solving math problems, humor generation requires continuous reflection and revision to foster creative thinking, rather than relying on a sudden flash of inspiration like Creative Leap-of-Thought (CLoT) paradigm. Although CLoT can realize the ability of remote association generation, this paradigm fails to emphasize the importance of rationales between those seemingly unrelated concepts. Therefore, in this paper, we propose a systematic way of thinking about generating humor and based on it, we built Creative Leap of Structured Thought (CLoST) frame. First, a reward model is necessary achieve the purpose of being able to correct errors, since there is currently no expert model of humor and a usable rule to determine whether a piece of content is humorous. Judgement-oriented instructions are designed to improve the capability of a model, and we also propose an open-domain instruction evolutionary method to fully unleash the potential. Then, through reinforcement learning, the model learns to hone its rationales of the thought chain and refine the strategies it uses. Thus, it learns to recognize and correct its mistakes, and finally generate the most humorous and creative answer. These findings deepen our understanding of the creative capabilities of LLMs and provide ways to enhance LLMs’ creative abilities for cross-domain innovative applications.

800Revisiting Large-Scale Non-convex Distributionally Robust Optimization

[openreview] [pdf]

Abstract Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0,L1)(L_0, L_1)-smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler algorithm design and convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ε-stationary point with O(ϵ4)\mathcal O(\epsilon^{-4}) gradient complexity, which matches with results in Jin et al. (2021). Our algorithm does not need to use momentum, and the proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ3)\mathcal O(\epsilon^{-3}). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021).

801Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

[openreview] [pdf]

Abstract Studying how to fine-tune offline reinforcement learning (RL) pre-trained policy is profoundly significant for enhancing the sample efficiency of RL algorithms. However, directly fine-tuning pre-trained policies often results in sub-optimal performance. This is primarily due to the distribution shift between offline pre-training and online fine-tuning stages. Specifically, the distribution shift limits the acquisition of effective online samples, ultimately impacting the online fine-tuning performance. In order to narrow down the distribution shift between offline and online stages, we proposed Q conditioned state entropy (QCSE) as intrinsic reward. Specifically, QCSE maximizes the state entropy of all samples individually, considering their respective Q values. This approach encourages exploration of low-frequency samples while penalizing high-frequency ones, and implicitly achieves State Marginal Matching (SMM), thereby ensuring optimal performance, solving the asymptotic sub-optimality of constraint-based approaches. Additionally, QCSE can seamlessly integrate into various RL algorithms, enhancing online fine-tuning performance. To validate our claim, we conduct extensive experiments, and observe significant improvements with QCSE ( about 10.9% for CQL and 8% for Cal-QL). Furthermore, we extended experimental tests to other algorithms, affirming the generality of QCSE.

802Out-of-distribution Generalization for Total Variation based Invariant Risk Minimization

[openreview] [pdf]

Abstract Invariant risk minimization is an important general machine learning framework that has recently been interpreted as a total variation model (IRM-TV). However, how to improve out-of-distribution (OOD) generalization in the IRM-TV setting remains unsolved. In this paper, we propose a novel OOD generalization approach for IRM-TV, named OOD-TV-IRM, based on its theoretical analysis. The key idea is to deploy an autonomous TV penalty that depends on the invariant feature extractor. We construct the autonomous TV penalty using a neural network with another set of parameters, which can be learned via an adversarial scheme against the parameters of the invariant feature extractor. Experimental results show that OOD-TV-IRM outperforms IRM-TV in most situations.

803Prompt Optimization with Human Feedback

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable performances in various tasks. However, the performances of LLMs heavily depend on the input prompt. This has given rise to a number of recent works on prompt optimization. However, the previous works often require the availability of a numeric score to assess the quality of every prompt. Unfortunately, when a human user interacts with a black-box LLM, it is often infeasible and unreliable to attain such a score. Instead, it is usually significantly easier and more reliable to obtain preference feedback from a human user, i.e., showing the user the responses generated from a pair of prompts and asking the user which one is preferred. Therefore, in this paper, we study the problem of prompt optimization with human feedback (POHF), in which we aim to optimize the prompt for a black-box LLM using only human preference feedback. By drawing inspirations from dueling bandits, we design a theoretically principled strategy to select a pair of prompts to query for preference feedback in every iteration, and hence introduce our algorithm named automated POHF (APOHF). We apply our APOHF algorithm to a variety of tasks, including optimizing user instructions, prompt optimization for text-to-image generative models, and response optimization with human feedback (i.e., further refining the response using a variant of our APOHF). The results demonstrate that our APOHF can efficiently find a good prompt using a small number of preference feedback instances.

804Dreamguider: Improved Training free Diffusion-based Conditional Generation

[openreview] [pdf]

Abstract Diffusion models have emerged as a formidable tool for training-free conditional generation. However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introduced minimal compute methods for linear inverse problems, a generic lightweight guidance solution to both linear and non-linear guidance problems is still missing. To this end, we propose Dreamguider, a method that enables inference-time guidance without compute-heavy backpropagation through the diffusion network. The key idea is to regulate the gradient flow through a time-varying factor. Moreover, we propose an empirical guidance scale that works for a wide variety of tasks, hence removing the need for handcrafted parameter tuning. We further introduce an effective lightweight augmentation strategy that significantly boosts the performance during inference-time guidance. We present experiments using Dreamguider on multiple tasks across multiple datasets and models to show the effectiveness of the proposed modules. To facilitate further research, we will make the code public after the review process.

805Mirror Descent Actor Critic via Bounded Advantage Learning

[openreview] [pdf]

Abstract Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical garantees, the performance improvement of a MDVI-based method over the entropy-only-regularized RL is limited in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the values of actor’s log-density terms in the critic’s loss function. Further, we relate MDAC to Advantage Learning by recalling that the actor’s log-probability is equal to the regularized advantage function in tabular cases, and theoretically show that the error of optimal policy misspecification is decreased by bounding the advantage terms.

[openreview] [pdf]

Abstract Transformer models have achieved remarkable results in the field of Natural Language Processing (NLP) with the introduction of breakthrough large language models like GPT and LLaMA recently. Motivated by their ability to capture long-range dependencies, researchers have successfully adapted these models to the task of time series forecasting. However, despite their potential, effectiveness of applying these pre-trained time series transformer models in the target domain is limited due to the need for hyper-parameter optimisation to match the characteristics of the target domain. This paper presents a novel algorithm that uses parameter efficient fine-tuning such as Low Rank Adaptation (LoRA) coupled with Limited Discrepancy Search (LDS) to efficiently auto fine-tune pre-trained time series transformers for a given target domain. Our approach helps in making informed design choices involving LoRA tunable hyper-parameters with strong performance-cost trade-offs that are highly transferable across different target domains. Our experiments demonstrate that autotune efficiently identifies the optimal configuration of LoRA hyper-parameters, achieving an average MASE improvement of 5.21% across all datasets and 4.76% for out-of-domain datasets compared to zero shot pre-trained models, with improvements as high as 20.59% for one of the out-of-domain datasets.

807Understanding Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

[openreview] [pdf]

Abstract Structured State Space Models (SSMs) have emerged as alternatives to transformers, addressing the challenges of processing long sequences. While SSMs are often regarded as effective in capturing long-term dependencies, we theoretically demonstrate that they suffer from a strong recency bias. Our empirical findings reveal that this bias impairs the models’ ability to recall distant information and introduces robustness issues. We conducted scaling experiments and discovered that deeper structures in SSMs facilitate the learning of long contexts. However, our theoretical analysis reveal that as SSMs increase in depth, they exhibit a tendency toward over-smoothing, resulting in token representations becoming increasingly indistinguishable. This over-smoothing phenomenon ultimately constrains the scalability of SSMs to achieve improved performance. Collectively, these findings highlight important limitations of SSMs and underscore the need for further research to address these challenges in long-range sequence modeling.

808Evaluating and Explaining the Severity of Distribution Shifts: Illustration with Tabular Text Classification

[openreview] [pdf]

Abstract After deploying a machine learning model, distribution shifts may emerge in real-world data. When dealing with unlabeled data, it can be challenging to accurately assess the impact of these drifts on the model’s performance, for any type and intensity of shift. In that case, decisions such as updating the model for every benign shift would not be cost-efficient. In this paper, we introduce the Error Classifier, an error assessment method that addresses two tasks: unsupervised performance estimation and error detection on out-of-distribution data. The Error Classifier computes the probability that the model will fail based on detected fault patterns. Further, we employ a sampling-based approximation of Shapley values, with the Error Classifier as value function, in order to explain why a shift is predicted as severe, in terms of feature values. As explanation methods can sometimes disagree, we suggest evaluating the consistency of explanations produced by our technique and different ones. We focus on classification and illustrate the relevance of our method in a bimodal context, on tabular datasets with text fields. We measure our method against a selection of 15 baselines from various domains, on 7 datasets with a variety of shifts, and 2 multimodal fusion strategies for the classification models. Lastly, we show the usefulness of our explanation algorithm on instances affected by various types of shifts.

809Incorporating Visual Correspondence into Diffusion Model for Visual Try-On

[openreview] [pdf]

Abstract Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets.

810Semantic-Aware Diffusion Model for Sequential Recommendation

[openreview] [pdf]

Abstract Sequential recommendation aims to predict the next click for a particular user based on their historical interacted item sequences. Recently, diffusion-based methods have achieved the state-of-the-art performance in sequential recommendation. However, they fail to effectively utilize the rich semantic information embedded in items during the diffusion process to accurately guide the generation, leading to sub-optimal results. To address this limitation, we designed SDRec, aSemantic-awareDiffusion model for sequentialRecommendation. Our model introduces a novel architecture, the Semantic Fusion Layer, which leverages the embedding table from the encoder to incorporate item semantics into the diffusion process through an attention mechanism. Together with the well-designed contrastive and generative losses, SDRec effectively utilizes the item semantics in diffusion model, unleashing the potential of sequential recommendation. Our experiments show that SDRec has over 10% relative gain with superior efficiency compared with existing methods.

811Differentiable Solver Search for fast diffusion sampling

[openreview] [pdf]

Abstract Diffusion-based models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal and reveals a compact search space comprised of timestep and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify the optimal solver. Equipped with the searched solver, our rectified flow models, SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-256×256256\times256 with only 10 steps. Meanwhile, our DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates its generality across various model architectures, resolutions, and model sizes.

812Distilling Reinforcement Learning into Single-Batch Datasets

[openreview] [pdf]

Abstract Dataset distillation compresses a large dataset into a small, often one-batch, synthetic dataset such that learning on the synthetic dataset approximates learning on the large dataset. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation’s ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of both a multi-dimensional extension of the classic cart-pole problem and several Atari games. We demonstrate distillation’s ability to compress complex RL environments into one-step supervised learning, explore RL distillation’s generalizability across learner architectures, and demonstrate distilling an environment into the smallest-possible synthetic dataset.

813Multi-Session Budget Optimization for Forward Auction-based Federated Learning

[openreview] [pdf]

Abstract Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MultiBOS-AFL). Based on hierarchical reinforcement learning, MultiBOS-AFL jointly optimizes inter-session budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MultiBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs in multi-session forward auction-based FL.

814What’s New in My Data? Novelty Exploration via Contrastive Generation

[openreview] [pdf]

Abstract Fine-tuning is widely used to adapt language models for specific goals, often leveraging real-world data such as patient records, customer-service interactions, or web content in languages not covered in pre-training. These datasets are typically massive, noisy, and often confidential, making their direct inspection challenging. However, understanding them is essential for guiding model deployment and informing decisions about data cleaning or suppressing any harmful behaviors learned during fine-tuning. In this study, we introduce the task of novelty discovery through generation, which aims to identify novel properties of a fine-tuning dataset by generating examples that illustrate these properties. Our approach - Contrastive Generative Exploration (CGE) - assumes no direct access to the data but instead relies on a pre-trained model and the same model after fine-tuning. By contrasting the predictions of these two models, CGE can generate examples that highlight novel characteristics of the fine-tuning data. However, this simple approach may produce examples that are too similar to one another, failing to capture the full range of novel phenomena present in the dataset. We address this by introducing an iterative version of CGE, where the previously generated examples are used to update the pre-trained model, and this updated model is then contrasted with the fully fine-tuned model to generate the next example, promoting diversity in the generated outputs. Our experiments demonstrate the effectiveness of CGE in detecting novel content, such as toxic language, as well as new natural and programming languages. Furthermore, we show that CGE remains effective even when models are fine-tuned using differential privacy techniques.

815Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

[openreview] [pdf]

Abstract Diffusion models have greatly improved visual generation but are hindered by slow generation speed due to the computationally intensive nature of solving generative ODEs. Rectified flow, a widely recognized solution, improves generation speed by straightening the ODE path. Its key components include: 1) using the diffusion form of flow-matching, 2) employing v\boldsymbol v-prediction, and 3) performing rectification (a.k.a. reflow). In this paper, we argue that the success of rectification primarily lies in using a pretrained diffusion model to obtain matched pairs of noise and samples, followed by retraining with these matched noise-sample pairs. Based on this, components 1) and 2) are unnecessary. Furthermore, we highlight that straightness is not an essential training target for rectification; rather, it is a specific case of flow-matching models. The more critical training target is to achieve a first-order approximate ODE path, which is inherently curved for models like DDPM and Sub-VP. Building on this insight, we propose Rectified Diffusion, which generalizes the design space and application scope of rectification to encompass the broader category of diffusion models, rather than being restricted to flow-matching models. We validate our methods on Stable Diffusion v1-5 and Stable Diffusion XL. Our methods not only greatly simplifies the training procedure of rectified flow-based previous works~(e.g., InstaFlow) but also achieves superior performance with even lower training cost.

816Exploiting Hidden Symmetry to Improve Objective Perturbation for DP linear learners with a nonsmoothℓ1-norm

[openreview] [pdf]

Abstract Objective Perturbation (OP) is a classic approach to differentially private (DP) convex optimization with smooth loss functions but is less understood for nonsmooth cases. In this work, we study how to apply OP to DP linear learners under loss functions with an implicit 1\ell_1-norm structure, such as max(0,x)\max(0,x) as a motivating example. We propose to first smooth out the hidden 1\ell_1-norm by convolution, and then invoke standard OP. Convolution has many advantages that distinguish itself from Moreau Envelope, such as approximating from above and a higher degree of hyperparameters. These advantages, in conjunction with the symmetry of 1\ell_1-norm, result in tighter pointwise approximation, which further facilitates tighter analysis of generalization risks by using pointwise bounds. Under mild assumptions on groundtruth distributions, the proposed OP-based algorithm is found to be rate-optimal, and can achieve the excess generalization risk O(1n+dln(1/δ)nε)\mathcal{O}(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\ln(1/\delta)}}{n\varepsilon}). Experiments demonstrate the competitive performance of the proposed method to Noisy-SGD.

817Scaling Laws for Diffusion Transformers

[openreview] [pdf]

Abstract Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, \emph{e.g.,} image and video generation.However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget.Therefore, experiments across a broad range of compute budgets, from \texttt{1e17} to \texttt{6e18} FLOPs are conducted to confirm the existence of scaling laws in DiT \emph{for the first time}. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute.Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of \texttt{1e21} FLOPs.Additionally, we also demonstrate that the trend of pretraining loss matches the generation performances (\emph{e.g.,} FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.

818Diffusion Transformers for Tabular Data Time Series Generation

[openreview] [pdf]

Abstract Tabular data generation has recently attracted a growing interest due to its different application scenarios. However, generating time series of tabular data, where each element of the series depends on the others, remains a largely unexplored domain. This gap is probably due to the difficulty of jointly solving different problems, the main of which are the heterogeneity of tabular data (a problem common to non-time-dependent approaches) and the variable length of a time series. In this paper, we propose a Diffusion Transformers (DiTs) based approach for tabular data series generation. Inspired by the recent success of DiTs in image and video generation, we extend this framework to deal with heterogeneous data and variable-length sequences. Using extensive experiments on six datasets, we show that the proposed approach outperforms previous work by a large margin. Our code will be made public after this article is accepted.

819Spatiotemporal Backward Inconsistency Learning Gives STGNNs Icing on the Cake

[openreview] [pdf]

Abstract Spatiotemporal prediction models facilitate various smart-city applications across various domains,such as traffic and climate. While current advancements in these models emphasize leveraging cutting-edge technologies to enhance spatiotemporal learning, they often operate under the implicit assumption of spatiotemporal feature consistency between inputs and labels, overlooking the critical issue of input-label inconsistency. In this study, we introduce a universal spatiotemporal backward inconsistency learning module capable of seamless integration into a variety of models, offering a notable performance boost by explicitly modeling label features to address input-label inconsistency. Our approach includes the development of a spatiotemporal residual theory, advocating for a holistic spatiotemporal learning that encompasses both forward spatiotemporal learning to capture input data’s spatiotemporal features for generating base predictions, akin to existing STNNs, and a backward process to learn residuals that rectify input-label inconsistency, thereby refining the base predictions. Based on this theory, we design the Spatio-Temporal Backward Inconsistency Learning Module (STBIM) for this backward correction process, comprising a residual learning module for decoupling inconsistency information from input representations and label representations, and a residual propagation module for smoothing residual terms to facilitate stable learning. The generated prediction correction term is used to enhance the prediction accuracy. Experimental results on 11 datasets from the traffic and atmospheric domains, combined with 15 spatiotemporal prediction models, demonstrate the broad positive impact of the proposed STBIM. The code is available athttps://anonymous.4open.science/r/ICLR2025-2598.

820AutoRegressive Knowledge Base Completion

[openreview] [pdf]

Abstract Despite their large sizes, many Knowledge Graphs (KGs) remain highly incomplete. This problem has motivated numerous approaches to complete\textit{complete} the KGs by embedding them in a latent space to find the missing links. Although these methods show promising performance, a general limitation is that the scores given to possible links are uncalibrated and cannot be interpreted across different queries. Hence, we say they are local\textit{local} as they relate to a specific context. This limitation makes it non-trivial to deduce the truth value of the links and to answer complex queries. Another limitation is that their learning depends on negative sampling, which is challenging due to the Open World Assumption (OWA).To solve this problem, we propose a novel auto-regressive generative model that learns a joint distribution of the entities and relations of the KG without resorting to negative sampling. This distribution can be used to infer the probability that a link is sampled from the KG, which allows us to return a global\textit{global} score that is interpretable in different contexts. Moreover, our method has the additional advantage that it offers probabilistic semantics for complex reasoning and knowledge base completion, achieving state-of-the-art performance on link prediction with consistent scores across the entire KG.

821Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

[openreview] [pdf]

Abstract Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., 1.73×10131.73\times10^{13}), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets (1000\sim 1000). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.

822State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

[openreview] [pdf]

Abstract Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently solve these tasks. Furthermore, we study the nonparametric regression task, and prove that the ability of SSMs is equivalent to that of Transformers in estimating functions belonging to a certain class.

823Improving Diffusion-based Data Augmentation with Inversion Circle Interpolation

[openreview] [pdf]

Abstract Data Augmentation (DA), i.e., synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today’s diffusion-based DA methods, and argue that they can- not take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting final classification performance. To this end, we propose a novel Diffusion-based Inversion Interpolation DA method: Diff-II. Specifically, Diff-II consists of three main steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting random circle interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on multiple image classification tasks (e.g., few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.

824Single Teacher, Multiple Perspectives: Teacher Knowledge Augmentation for Enhanced Knowledge Distillation

[openreview] [pdf]

Abstract Do diverse perspectives help students learn better? Multi-teacher knowledge distillation, which is a more effective technique than traditional single-teacher methods, supervises the student from different perspectives (i.e., teacher). While effective, multi-teacher, teacher ensemble, or teaching assistant-based approaches are computationally expensive and resource-intensive, as they require training multiple teacher networks. These concerns raise a question: can we supervise the student with diverse perspectives using only a single teacher? We, as the pioneer, demonstrate TeKAP, a novel teacher knowledge augmentation technique that generates multiple synthetic teacher knowledge by perturbing the knowledge of a single pretrained teacher i.e., Teacher Knowledge Augmentation via Perturbation, at both the feature and logit levels. These multiple augmented teachers simulate an ensemble of models together. The student model is trained on both the actual and augmented teacher knowledge, benefiting from the diversity of an ensemble without the need to train multiple teachers. TeKAP significantly reduces training time and computational resources, making it feasible for large-scale applications and easily manageable. Experimental results demonstrate that our proposed method helps existing state-of-the-art knowledge distillation techniques achieve better performance, highlighting its potential as a cost-effective alternative. The source code can be found in the supplementary.

825Reset Method based on the Theory of Manifold Optimization on Real Manifolds

[openreview] [pdf]

Abstract Manifold optimization is prominent in the fields of applied mathematics, statistics, machine learning, and in particular, deep learning. By leveraging the intrinsic geometric properties of manifolds, constrained optimization problems can be transformed into unconstrained optimization problems on certain manifolds. An innovative method, Reset Method, is introduced that combines manifold optimization and standard methods (SGD, Adam and AdamW), aiming to enhance the improvement of precision. The efficacy of our proposed method is corroborated by extensive deep learning experiments, providing visible higher precision.

826Beyond Finite Data: Towards Data-free Out-of-distribution Generalization via Extrapolation

[openreview] [pdf]

Abstract Out-of-distribution (OOD) generalization is a favorable yet challenging property for deep neural networks. The core challenges lie in the limited availability of source domains that help models learn an invariant representation from the spurious features. Various domain augmentation have been proposed but largely rely on interpolating existing domains and frequently face difficulties in creating truly “novel” domains. Humans, on the other hand, can easily extrapolate novel domains, thus, an intriguing question arises: How can neural networks extrapolate like humans and achieve OOD generalization? We introduce a novel approach to domain extrapolation that leverages reasoning ability and the extensive knowledge encapsulated within large language models (LLMs) to synthesize entirely new domains. Starting with the class of interest, we query the LLMs to extract relevant knowledge for these novel domains. We then bridge the gap between the text-centric knowledge derived from LLMs and the pixel input space of the model using text-to-image generation techniques. By augmenting the training set of domain generalization datasets with high-fidelity, photo-realistic images of these new domains, we achieve significant improvements over all existing methods, as demonstrated in both single and multi-domain generalization across various benchmarks. With the ability to extrapolate any domains for any class, our method has the potential to learn a generalized model for any task without any data. To illustrate, we put forth a much more difficult setting termed, data-free domain generalization, that aims to learn a generalized model in the absence of any collected data. Our empirical findings support the above argument and our methods exhibit commendable performance in this setting, even surpassing the supervised setting by approximately 1-2% on datasets such as VLCS.

827Direct Advantage Estimation in Partially Observable Environments

[openreview] [pdf]

Abstract Direct Advantage Estimation (DAE) was recently shown to improve sample-efficiency of deep reinforcement learning algorithms. However, DAE assumes full observability of the environment, which may be restrictive in realistic settings. In the present work, we first show that DAE can be extended to partially observable domains with minor modifications. Secondly, we address the increased computational cost due to the need to approximate the transition probabilities through the use of discrete latent dynamics models. Finally, we empirically evaluate the proposed method using the Arcade Learning Environments, and show that it is scalable and sample-efficient.

828Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

[openreview] [pdf]

Abstract Federated Learning (FL) is a privacy-preserving distributed machine learning paradigm. Nonetheless, the substantial distribution shifts among clients pose a considerable challenge to the performance of current FL algorithms. To mitigate this challenge, various methods have been proposed to enhance the FL training process. This paper endeavors to tackle the issue of data heterogeneity from another perspective---by improving FL algorithms prior to the actual training stage. Specifically, we introduce the Client2Vec mechanism, which generates a unique client index for each client before the commencement of FL training. Subsequently, we leverage the generated client index to enhance the subsequent FL training process. To demonstrate the effectiveness of the proposed Client2Vec method, we conduct three case studies that assess the impact of the client index on the FL training process. These case studies encompass enhanced client sampling, model aggregation, and local training. Extensive experiments conducted on diverse datasets and model architectures show the efficacy of Client2Vec across all three case studies. Our code will be publicly available.

829Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

[openreview] [pdf]

Abstract Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves nearly horizon-free and second-order bounds. Nearly horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies which can be small when the system is nearly deterministic and (or) the optimal policy has small values. We highlight that our algorithms are simple, fairly standard, and indeed have been extensively studied in the RL literature: they learn a model via MLE, build a version space around the MLE solution, and perform optimistic or pessimistic planning depending on whether operating in the online or offline mode. These algorithms do not rely on additional specialized algorithmic designs such as learning variances and performing variance-weighted learning and thus can easily leverage non-linear function approximations. The simplicity of the algorithms also implies that our horizon-free and second-order regret analysis is actually standard and mainly follows the general framework of optimism/pessimism in the face of uncertainty.

830A Hypothesis on Black Swan in Unchanging Environments

[openreview] [pdf]

Abstract Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.

831Exploratory Preference Optimization: Provably Sample-Efficient Exploration in RLHF with General Function Approximation

[openreview] [pdf]

Abstract This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm,Exploratory Preference Optimization(XPO). This algorithm is elegantly simple---requiring only a one-line modification to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023)---yet provides the strongest known provable guarantees. XPO augments the DPO objective with a novel and principledexploration bonus, enabling the algorithm to strategically explore beyond the support of the initial model and preference feedback data. We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model’s coverage. Our analysis builds on the observation that DPO implicitly performs a form ofBellman error minimization. It synthesizes previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the lens ofKL-regularized Markov decision processes.

832Transformers Learn Bayesian Networks Autoregressively In-Context

[openreview] [pdf]

Abstract Transformers have achieved tremendous successes in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite their achievements, there is limited understanding of the theoretical capabilities of transformers. In this paper, we theoretically investigate the capability of transformers to autoregressively learn Bayesian networks in-context. Specifically, we consider a setting where a set of independent samples generated from a Bayesian network are observed and form a context. We show that, there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis showcases the potential of transformers to effectively learn complicated probabilistic models, and contributes to a better understanding of the success of large language models.

833Evaluating Ranking Loss Functions in Performance Predictor for NAS

[openreview] [pdf]

Abstract Performance evaluation is a critical but compute-intensive procedure in neural architecture search (NAS). To alleviate evaluation costs, performance predictors have been widely adopted to predict architecture performance directly. Recent studies have introduced ranking loss functions into predictors to focus on the architecture rankings instead of absolute accuracy, thus enhancing the ranking ability of performance predictors. Despite the successful application of ranking loss functions, the lack of comprehensive measure metrics and different experimental configurations make a fair comparison among these loss functions a huge challenge. Additionally, some well-known ranking loss functions have not been thoroughly examined in the context of performance predictors. In this paper, we conduct the first study for 11 ranking loss functions containing the existing and the novel ones by comparing their effectiveness in performance predictors under various settings. We find that: (i) The choice of ranking loss function has a major influence on the performance of predictors; (ii) the quality of the architectures searched by the predictor-based NAS methods is closely correlated with the predictor’s performance on top-centered rank metrics, rather than traditional metrics like Kendall Tau. We believe these results and insights can serve as recommendations for the optimal loss function to employ in predictors across various search spaces and experimental conditions.

834Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

[openreview] [pdf]

Abstract Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies gradient noise as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.

835Learning Randomized Algorithms with Transformers

[openreview] [pdf]

Abstract Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks’ computation and predictions.

836Overcoming label shift in targeted federated learning

[openreview] [pdf]

Abstract Federated learning enables multiple actors to collaboratively train models without sharing private data. This unlocks the potential for scaling machine learning to diverse applications. Existing algorithms for this task are well-justified when clients and the intended target domain share the same distribution of features and labels, but this assumption is often violated in real-world scenarios. One common violation is label shift, where the label distributions differ across clients or between clients and the target domain, which can significantly degrade model performance. To address this problem, we propose FedPALS, a novel model aggregation scheme that adapts to label shifts by leveraging knowledge of the target label distribution at the central server. Our approach ensures unbiased updates under stochastic gradient descent, ensuring robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification demonstrate that FedPALS consistently outperforms standard baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme client sparsity, highlighting the critical need for target-aware aggregation. FedPALS offers a principled and practical solution to mitigate label distribution mismatch, ensuring models trained in federated settings can generalize effectively to label-shifted target domains.

837Differentiable Reasoning about Knowledge Graphs with Reshuffled Embeddings

[openreview] [pdf]

Abstract Knowledge graph (KG) embedding methods learn geometric representations of entities and relations to predict plausible missing knowledge. These representations are typically assumed to capture rule-like inference patterns. However, our theoretical understanding of the kinds of inference patterns that can be captured in this way remains limited. Ideally, KG embedding methods should be expressive enough such that for any set of rules, there exists an embedding that exactly captures these rules. This principle has been studied within the framework of region-based embeddings, but existing models are severely limited in the kinds of rule bases that can be captured. We argue that this stems from the use of representations that correspond to the Cartesian product of two-dimensional regions. As an alternative, we propose RESHUFFLE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches. Moreover, the embeddings in our framework can be learned by a Graph Neural Network (GNN), which effectively acts as a differentiable rule base. This has some practical advantages, e.g. ensuring that embeddings can be easily updated as new knowledge is added to the KG. At the same time, since the resulting representations can be used similarly to standard KG embeddings, our approach is significantly more efficient than existing approaches to differentiable reasoning. The GNN-based formulation also allows us to study how bounded inference can be captured. We show in particular that bounded reasoning with arbitrary sets of closed path rules can be captured in this way.

838Beyond Auto-Regression: Fast LLMs via Self-Distillation Through Time

[openreview] [pdf]

Abstract Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advancements have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV-caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.

839Scalable Decentralized Learning with Teleportation

[openreview] [pdf]

Abstract Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.

840Learning Task Belief Similarity with Latent Dynamics for Meta-Reinforcement Learning

[openreview] [pdf]

Abstract Meta-reinforcement learning requires utilizing prior task distribution information obtained during exploration to rapidly adapt to unknown tasks. The efficiency of an agent’s exploration hinges on accurately identifying the current task. Recent Bayes-Adaptive Deep RL approaches often rely on reconstructing the environment’s reward signal, which is challenging in sparse reward settings, leading to suboptimal exploitation. Inspired by bisimulation metrics, which robustly extracts behavioral similarity in continuous MDPs, we propose SimBelief—a novel meta-RL framework via measuring similarity of task belief in Bayes-Adaptive MDP (BAMDP). SimBelief effectively extracts common features of similar task distributions, enabling efficient task identification and exploration in sparse reward environments. We introduce latent task belief metric to learn the common structure of similar tasks and incorporate it into the real task belief. By learning the latent dynamics across task distributions, we connect shared latent task belief features with specific task features, facilitating rapid task identification and adaptation. Our method outperforms state-of-the-art bselines on sparse reward MuJoCo and panda-gym tasks.

841Leveraging Additional Information in POMDPs with Guided Policy Optimization

[openreview] [pdf]

Abstract Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of supplementary information while ensuring alignment with the learner’s policy, which is primarily trained via Imitation Learning (IL). We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in IL approaches. Our approach includes two practical variants, GPO-penalty and GPO-clip, and empirical evaluations show strong performance across various tasks, including continuous control with partial observability and noise, and memory-based challenges, significantly outperforming existing methods.

842Causal-aware Graph Neural Architecture Search under Distribution Shifts

[openreview] [pdf]

Abstract Graph neural architecture search (Graph NAS) has emerged as a promising approach for autonomously designing graph neural network architectures by leveraging the correlations between graphs and architectures. However, the existing methods fail to generalize under distribution shifts that are ubiquitous in real-world graph scenarios, mainly because the graph-architecture correlations they exploit might be spurious and varying across distributions. In this paper, we propose to handle the distribution shifts in the graph architecture search process by discovering and exploiting the causal relationship between graphs and architectures to search for the optimal architectures that can generalize under distribution shifts. The problem remains unexplored with the following critical challenges: 1) how to discover the causal graph-architecture relationship that has stable predictive abilities across distributions, 2) how to handle distribution shifts with the discovered causal graph-architecture relationship to search the generalized graph architectures. To address these challenges, we propose a novel approach, Causal-aware Graph Neural Architecture Search (CARNAS), which is able to capture the causal graph-architecture relationship during the architecture search process and discover the generalized graph architecture under distribution shifts. Specifically, we propose Disentangled Causal Subgraph Identification to capture the causal subgraphs that have stable prediction abilities across distributions. Then, we propose Graph Embedding Intervention to intervene on causal subgraphs within the latent space, ensuring that these subgraphs encapsulate essential features for prediction while excluding non-causal elements. Additionally, we propose Invariant Architecture Customization to reinforce the causal invariant nature of the causal subgraphs, which are utilized to tailor generalized graph architectures. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed CARNAS achieves advanced out-of-distribution generalization ability by discovering the causal relationship between graphs and architectures during the search process.

843Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning

[openreview] [pdf]

Abstract Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein techniques such as reinforcement learning can obtain broad coverage over states and actions. However, high-fidelity physics simulators are fundamentally misspecified approximations to reality, making direct zero-shot transfer challenging, especially in tasks where precise and forceful manipulation is necessary. This makes real-world fine-tuning of policies pretrained in simulation an attractive approach to robot learning. However, exploring the real-world dynamics with standard RL fine-tuning techniques is to inefficient for many real-world applications. This paper introduces Simulation-Guided Fine-Tuning, a general framework which leverages the structure of the simulator to guide exploration, substantially accelerating adaptation to the real-world. We demonstrate our approach across several manipulation tasks in the real world, learning successful policies for problems that are challenging to learn using purely real-world data. We further provide theoretical backing for the paradigm.

844HelpSteer2-Preference: Complementing Ratings with Preferences

[openreview] [pdf]

Abstract Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) and openly release the trained Reward Model.

845DiNO-Diffusion: Scaling Medical Diffusion Models via Self-Supervised Pre-Training

[openreview] [pdf]

Abstract Diffusion models (DMs) require large annotated datasets for training, limiting their applicability in medical imaging where datasets are typically smaller and sparsely annotated. We introduce DiNO-Diffusion, a self-supervised method for training DMs that conditions the generation process on image embeddings extracted from DiNO, a pretrained vision transformer. By not relying on annotations, our training leverages over 868k unlabelled images from public chest X-Ray (CXR) datasets. DiNO-Diffusion shows comprehensive manifold coverage, with FID scores as low as 4.7, and emerging properties when evaluated in downstream tasks, allowing to generate semantically-diverse synthetic datasets even from small data pools, demonstrating up to 20% AUC increase in classification performance when used for data augmentation. Results suggest that DiNO-Diffusion could facilitate the creation of large datasets for flexible training of downstream AI models from limited amount of real data, while also holding potential for privacy preservation. Additionally, DiNO-Diffusion demonstrates zero-shot segmentation performance of up to 84.4% Dice score when evaluating lung lobe segmentation, evidencing good CXR image-anatomy alignment akin to textual descriptors on vanilla DMs. Finally, DiNO-Diffusion can be easily adapted to other medical imaging modalities or state-of-the-art diffusion models, allowing large-scale, multi-domain image generation pipelines for medical imaging.

846FederatedQ-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost

[openreview] [pdf]

Abstract In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated QQ-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated QQ-Learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and adopts three novel designs: separate event-triggered communication and policy switching, heterogeneous communication triggering conditions, and optional forced synchronization. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.

847Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

[openreview] [pdf]

Abstract Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces a unique cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.

848Universal generalization guarantees for Wasserstein distributionally robust models

[openreview] [pdf]

Abstract Distributionally robust optimization has emerged as an attractive way to train robust machine learning models, capturing data uncertainty and distribution shifts. Recent statistical analyses have proved that generalization guarantees of robust models based on the Wasserstein distance have generalization guarantees that do not suffer from the curse of dimensionality. However, these results are either approximate, obtained in specific cases, or based on assumptions difficult to verify in practice. In contrast, we establish exact generalization guarantees that cover a wide range of cases, with arbitrary transport costs and parametric loss functions, including deep learning objectives with nonsmooth activations. We complete our analysis with an excess bound on the robust objective and an extension to Wasserstein robust models with entropic regularizations.

[openreview] [pdf]

Abstract We present Bayesian Binary Search (BBS), a novel probabilistic variant of the classical binary search/bisection algorithm. BBS leverages machine learning/statistical techniques to estimate the probability density of the search space and modifies the bisection step to split based on probability density rather than the traditional midpoint, allowing for the learned distribution of the search space to guide the search algorithm. Search space density estimation can flexibly be performed using supervised probabilistic machine learning techniques (e.g., Gaussian process regression, Bayesian neural networks, quantile regression) or unsupervised learning algorithms (e.g., Gaussian mixture models, kernel density estimation (KDE), maximum likelihood estimation (MLE)). We demonstrate significant efficiency gains of using BBS on both simulated data across a variety of distributions and in a real-world binary search use case of probing channel balances in the Bitcoin Lightning Network, for which we have deployed the BBS algorithm in a production setting.

850Temporal Source Recovery for Time-Series Source-Free Unsupervised Domain Adaptation

[openreview] [pdf]

Abstract Source-Free Unsupervised Domain Adaptation (SFUDA) has gained popularity for its ability to adapt pretrained models to target domains without accessing source domains, ensuring source data privacy. While SFUDA is well-developed in visual tasks, its application to Time-Series SFUDA (TS-SFUDA) remains limited due to the challenge of transferring crucial temporal dependencies across domains. Although a few researchers begin to explore this area, they rely on specific source domain designs, which are impractical as source data owners cannot be expected to follow particular pretraining protocols. To solve this, we propose Temporal Source Recovery (TemSR), a framework that transfers temporal dependencies for effective TS-SFUDA without requiring source-specific designs. TemSR features a recovery process that leverages masking, recovery, and optimization to generate a source-like distribution with recovered source temporal dependencies. To ensure effective recovery, we further design segment-based regularization to restore local dependencies and anchor-based recovery diversity maximization to enhance the diversity of the source-like distribution. The source-like distribution is then adapted to the target domain using traditional UDA techniques. Extensive experiments across multiple TS tasks demonstrate the effectiveness of TemSR, even surpassing existing TS-SFUDA method that requires source domain designs.

851Decomposed Learning and Grokking

[openreview] [pdf]

Abstract Grokking is a delayed transition from memorisation to generalisation in neural networks. It poses challenges for efficient learning, particularly in structured tasks and small-data regimes. This paper explores grokking in modular arithmetic, explicitly focusing on modular division with a modulus of 97. We introduce a novel learning method called Decomposed Learning, which leverages Singular Value Decomposition (SVD) to modify the weight matrices of neural networks. Decomposed learning reduces or avoids grokking by changing the representation of the weight matrix, AA, into the product of three matrices UU, Σ and VTV^T, promoting the discovery of compact, generalisable representations early in the learning process. Through empirical evaluations on the modular division task, we show that Decomposed Learning significantly reduces the effect of grokking and, in some cases, eliminates it. Moreover, Decomposed Learning can reduce the parameters required for practical training, enhancing model efficiency and generalisation. These results suggest that our SVD-based method provides a practical and scalable solution for mitigating grokking, with implications for broader transformer-based learning tasks.

852Continual Learning: Less Forgetting, More OOD Generalization via Adaptive Contrastive Replay

[openreview] [pdf]

Abstract Machine learning models often suffer from catastrophic forgetting of previously learned knowledge when learning new classes. Various methods have been proposed to mitigate this issue. However, rehearsal-based learning, which retains samples from previous classes, typically achieves good performance but tends to memorize specific instances, struggling with Out-of-Distribution (OOD) generalization. This often leads to high forgetting rates and poor generalization. Surprisingly, the OOD generalization capabilities of these methods have been largely unexplored. In this paper, we highlight this issue and propose a simple yet effective strategy inspired by contrastive learning and data-centric principles to address it. We introduce Adaptive Contrastive Replay (ACR), a method that employs dual optimization to simultaneously train both the encoder and the classifier. ACR adaptively populates the replay buffer with misclassified samples while ensuring a balanced representation of classes and tasks. By refining the decision boundary in this way, ACR achieves a balance between stability and plasticity. Our method significantly outperforms previous approaches in terms of OOD generalization, achieving an improvement of 13.41% on Split CIFAR-100, 9.91% on Split Mini-ImageNet, and 5.98% on Split Tiny-ImageNet.

853Offline Reinforcement Learning with Closed-loop Policy Evaluation and Diffusion World-Model Adaptation

[openreview] [pdf]

Abstract Generative models, particularly diffusion models, have been utilized as world models in offline reinforcement learning (RL) to generate synthetic data, enhancing policy learning efficiency. Current approaches either train diffusion models once before policy learning begins or rely on online interactions for alignment. In this paper, we propose a novel offline RL algorithm, Adaptive Diffusion World Model for Policy Evaluation (ADEPT), which integrates closed-loop policy evaluation with world model adaptation. It employs an uncertainty-penalized diffusion model to iteratively interact with the target policy for evaluation. The uncertainty of the world model is estimated by comparing the output generated with different noises, which is then used to constrain out-of-distribution actions. During policy training, the diffusion model performs importance-sampled updates to progressively align with the evolving policy. We analyze the performance of the proposed method and provide an upper bound on the return gap between our method and the real environment under an optimal policy. The results shed light on various key factors affecting learning performance. Evaluations on the D4RL benchmark demonstrate significant improvement over state-of-the-art baselines, especially when only sub-optimal demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.

854Mitigating Unobserved Confounding via Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Learning Conditional average treatment effect estimation from observational data is a challenging task due to the existence of unobserved confounders. Previous methods mostly focus on assuming the Ignorability assumption ignoring the unobserved confounders or overlooking the impact of an a priori knowledge on the generation process of the latent variable, which can be quite impractical in real-world scenarios. Motivated by the recent advances in the latent variable modeling, we propose to capture the unobserved latent space using diffusion model, and accordingly to estimate the causal effect. More concretely, we build on the reverse diffusion process for the unobserved confounders as a Markov chain conditioned on an apriori knowledge. In order to implement our model in a feasible way, we derive the variational bound in closed form. In the experiments, we compare our model with the state-of-the-art methods based on both synthetic and real-world datasets, demonstrating consistent improvements of our model.

855Learning to Achieve Goals with Belief State Transformers

[openreview] [pdf]

Abstract We introduce the “Belief State Transformer”, a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown.Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems.

856ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

[openreview] [pdf]

Abstract Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Despite advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model’s capacity to generate up to 5×5\times its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn’t compromise the model’s innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

857On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning

[openreview] [pdf]

Abstract Deep Reinforcement Learning (DRL) policies are critically vulnerable to adversarial noise in observations, posing severe risks in safety-critical scenarios. For example, a self-driving car receiving manipulated sensory inputs about traffic signs could lead to catastrophic outcomes. Existing strategies to fortify RL algorithms against such adversarial perturbations generally fall into two categories: (a) using regularization methods that enhance robustness by incorporating adversarial loss terms into the value objectives, and (b) adopting “maximin” principles, which focus on maximizing the minimum value to ensure robustness. While regularization methods reduce the likelihood of successful attacks, their effectiveness drops significantly if an attack does succeed. On the other hand, maximin objectives, although robust, tend to be overly conservative. To address this challenge, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), which naturally balances optimizing value and robustness against adversarial attacks. To optimize ACoE in a scalable manner in model-free settings, we propose a theoretically justified surrogate objective known as Cumulative-ACoE (C-ACoE). The core idea of optimizing C-ACoE is utilizing the belief about the underlying true state given the adversarially perturbed observation. Our empirical evaluations demonstrate that our method outperforms current state-of-the-art approaches for addressing adversarial RL problems across all established benchmarks (MuJoCo, Atari, and Highway) used in the literature.

858General Framework for Off-Policy Learning with Partially-Observed Reward

[openreview] [pdf]

Abstract Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards are only partially observed, the effectiveness of OPL degrades severely. Well-known examples of such partial rewards include explicit ratings in content recommendations, conversion signals on e-commerce platforms that are partial due to delay, and the issue of censoring in medical problems. One possible solution to deal with such partial rewards is to use secondary rewards, such as dwelling time, clicks, and medical indicators, which are more densely observed. However, relying solely on such secondary rewards can also lead to poor policy learning since they may not align with the target reward. Thus, this work studies a new and general problem of OPL where the goal is to learn a policy that maximizes the expected target reward by leveraging densely observed secondary rewards as supplemental data. We then propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR), which effectively uses the secondary rewards in addition to the partially observed target reward to achieve effective OPL despite the challenging scenario. We also discuss a case where we aim to optimize not only the expected target reward but also the expected secondary rewards to some extent; counter-intuitively, we will show that leveraging the two objectives is in fact advantageous also for the optimization of only the target reward. Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that HyPeR outperforms existing methods in various scenarios.

859Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models Trained on Corrupted Data

[openreview] [pdf]

Abstract We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Firstly, we extend the Ambient Diffusion framework to enable training directly from measurements corrupted in the Fourier domain. Subsequently, we train diffusion models for MRI with access only to Fourier subsampled multi-coil measurements at acceleration factors R=2,4,6,8=2, 4, 6, 8. Secondly, we propose Ambient Diffusion Posterior Sampling\textit{Ambient Diffusion Posterior Sampling} (A-DPS), a reconstruction algorithm that leverages generative models pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling on measurements from a different forward process (e.g. image blurring). For MRI reconstruction in high acceleration regimes, we observe that A-DPS models trained on subsampled data are better suited to solving inverse problems than models trained on fully sampled data. We also test the efficacy of A-DPS on natural image datasets (CelebA, FFHQ, and AFHQ) and show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance.

860Variational Mode Decomposition and Linear Embeddings are What You Need For Time-Series Forecasting

[openreview] [pdf]

Abstract Time-series forecasting often faces challenges due to data volatility, which can lead to inaccurate predictions. Variational Mode Decomposition (VMD) has emerged as a promising technique to mitigate volatility by decomposing data into distinct modes, enhancing forecast accuracy. This study integrates VMD with linear models to develop a robust forecasting framework. Our approach is evaluated on 13 diverse datasets, including ETTm2, WindTurbine, M4, and 10 air quality datasets from Southeast Asian cities. The effectiveness of the VMD strategy is assessed by comparing Root Mean Squared Error (RMSE) values from models utilizing VMD against those without it. Additionally, we benchmark linear-based models against well-known neural network architectures such as LSTM, BLSTM, and RNN. The results demonstrate a significant reduction in RMSE across nearly all models following VMD application. Notably, the Linear + VMD model achieved the lowest average RMSE in univariate forecasting at 0.619. In multivariate forecasting, the DLinear + VMD model consistently outperformed others, attaining the lowest RMSE across all datasets with an average of 0.019. These findings underscore the effectiveness of combining VMD with linear models for superior time-series forecasting.

861How to Get Your LLM to Generate Challenging Problems for Evaluation

[openreview] [pdf]

Abstract The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduceCHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70% when we scale up the context size to 50k tokens.

862A Computation and Communication Efficient Projection-free Algorithm for Decentralized Constrained Optimization

[openreview] [pdf]

Abstract Decentralized constrained optimization problems arise in numerous real-world applications, where a major challenge lies in the computational complexity of projecting onto complex sets, especially in large-scale systems. The projection-free method, Frank-Wolfe (FW), is popular for the constrained optimization problem with complex sets due to its efficiency in tackling the projection process. However, when applying FW methods to decentralized constrained finite-sum optimization problems, previous studies provide suboptimal incremental first-order oracle (IFO) bounds in both convex and non-convex settings. In this paper, we propose a stochastic algorithm named Decentralized Variance Reduction Gradient Tracking Frank-Wolfe (DVRGTFW\texttt{DVRGTFW}), which incorporates the techniques of variance reduction, gradient tracking, and multi-consensus in the FW update to obtain tight bounds. We present a novel convergence analysis, diverging from previous decentralized FW methods, and demonstrating O~(n+nmLε1)\tilde{\mathcal{O}}(n+\sqrt{\frac{n}{m}}L\varepsilon^{-1}) and O(nmL2ε2)\mathcal{O}(\sqrt{\frac{n}{m}}L^2\varepsilon^{-2}) IFO complexity bounds in convex and non-convex settings, respectively. To the best of our knowledge, these bounds are the best achieved in the literature to date. Besides, in the non-convex case, DVRGTFW\texttt{DVRGTFW} achieves O(L2ε21λ2(W))\mathcal{O}(\frac{L^2\varepsilon^{-2}}{\sqrt{1-\lambda_2(W)}}) communication complexity which is closed to the lower bound Ω(Lε21λ2(W))\Omega(\frac{L\varepsilon^{-2}}{\sqrt{1-\lambda_2(W)}}). Empirical results validate the convergence properties of DVRGTFW\texttt{DVRGTFW} and highlight its superior performance over other related methods.

863On last-iterate convergence of distributed Stochastic Gradient Descent algorithm with momentum

[openreview] [pdf]

Abstract Distributed Stochastic Gradient optimization algorithms are studied extensively to address challenges in centralized approaches, such as data privacy, communication load, and computational efficiency, especially when dealing with large datasets. However, convergence theory research for these algorithms has been limited, particularly for distributed momentum-based SGD (mSGD) algorithms. Current theoretical work on distributed mSGD algorithms primarily focuses on establishing time-average convergence theory, whereas last-iterate convergence—considered a stronger and more practical definition than time-average convergence—has yet to be thoroughly explored. In this paper, we aim to establish the last-iterate convergence theory for a class of distributed mSGD algorithms with a decaying learning rate. First, we propose a general framework for distributed mSGD algorithms. Within this framework and under general conditions, we have proven the last-iterate convergence of the gradient of the loss function for a class of distributed mSGD algorithms. Furthermore, we have estimated the corresponding last-iterate convergence rate under supplementary conditions. Moreover, we theoretically prove that in the early stage, the adding of a momentum term can make the iterations converge more rapidly to a neighborhood of the stationary point. Some experiments are provided to illustrate the theoretical findings.

864From Conflicts to Convergence: A Zeroth-order Method for Multi-Objective Learning

[openreview] [pdf]

Abstract Multi-objective learning (MOL) is a popular paradigm for learning problems under multiple criteria, where various dynamic weighting algorithms (e.g., MGDA and MODO) have been formulated to find an updated direction for avoiding conflicts among objectives. Recently, increasing endeavors have struggled to tackle the black-box MOL when the gradient information of objectives is unavailable or difficult to be attained. Albeit the impressive success of zeroth-order method for single-objective black-box learning, the corresponding MOL algorithm and theoretical understanding are largely absent. Unlike single-objective problems, the errors of MOL introduced by zeroth-order gradients can simultaneously affect both the gradient estimation and the gradient coefficients λ, leading to further error amplification. To address this issue, we propose a Stochastic Zeroth-order Multiple Objective Descent algorithm (SZMOD), which leverages function evaluations to approximate gradients and develops a new decomposition strategy to handle the complicated black-box multi-objective optimization. Theoretically, we provide convergence and generalization guarantees for SZMOD in both general non-convex and strongly convex settings. Our results demonstrate that the proposed SZMOD enjoys a promising generalization bound of O(n12)\mathcal{O}(n^{-\frac{1}{2}}), which is comparable to the existing results of first-order methods requiring additional gradient information. Experimental results validate our theoretical analysis.

865ParetoFlow: Guided Flows in Multi-Objective Optimization

[openreview] [pdf]

Abstract In offline multi-objective optimization (MOO), we leverage an offline dataset of designs and their associated labels to simultaneously minimize multiple objectives. This setting more closely mirrors complex real-world problems compared to single-objective optimization. Recent works mainly employ evolutionary algorithms and Bayesian optimization, with limited attention given to the generative modeling capabilities inherent in such data. In this study, we explore generative modeling in offline MOO through flow matching, noted for its effectiveness and efficiency. We introduce a \textit{ParetoFlow} method, specifically designed to guide flow sampling to approximate the Pareto front. Traditional predictor~(classifier) guidance is inadequate for this purpose because it models only a single objective. In response, we propose a \textit{multi-objective predictor guidance} module that assigns each sample a weight vector, representing a weighted distribution across multiple objective predictions. A local filtering scheme is introduced to address non-convex Pareto fronts. These weights uniformly cover the entire objective space, effectively directing sample generation towards the Pareto front. Since distributions with similar weights tend to generate similar samples, we introduce a \textit{neighboring evolution} module to foster knowledge sharing among neighboring distributions. This module generates offspring from these distributions, and selects the most promising one for the next iteration. Our method achieves state-of-the-art performance across various tasks. Our code is available.

866Do LLMs estimate uncertainty well in instruction-following?

[openreview] [pdf]

Abstract Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling a comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide a crucial understanding of LLMs’ limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.

867ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

[openreview] [pdf]

Abstract While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, and object erasure demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.

868Beyond Markov Assumption: Improving Sample Efficiency in MDPs by Historical Augmentation

[openreview] [pdf]

Abstract Under the Markov assumption of Markov Decision Processes (MDPs), an optimal stationary policy does not need to consider history and is no worse than any non-stationary or history-dependent policy. Therefore, existing Deep Reinforcement Learning (DRL) algorithms usually model sequential decision-making as an MDP and then try to optimize a stationary policy by single-step state transitions. However, such optimization is often faced with sample inefficiency when the causal relationships of state transitions are complex. To address the above problem, this paper investigates if augmenting the states with their historical information can simplify the complex causal relationships in MDPs and thus improve the sample efficiency of DRL. First, we demonstrate that a complex causal relationship of single-step state transitions may be inferred by a simple causal function of the historically augmented states. Then, we propose a convolutional neural network architecture to learn the representation of the current state and its historical trajectory. The main idea of this representation learning is to compress the high-dimensional historical trajectories into a low-dimensional space. In this way, we can extract the simple causal relationships from historical information and avoid the overfitting caused by high-dimensional data. Finally, we formulate Historical Augmentation Aided Actor-Critic (HA3C) algorithm by adding the learned representations to the actor-critic method. The experiment on standard MDP tasks demonstrates that HA3C outperforms current state-of-the-art methods in terms of both sample efficiency and performance.

869Analytic DAG Constraints for Differentiable DAG Learning

[openreview] [pdf]

Abstract Recovering underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set {f(x)=c0+i=1cixic00;i>0,ci>0;r=limici/ci+1>0}\{f(x) = c_0 + \sum_{i=1}c_ix^i|c_0 \geqslant 0; \forall i > 0, c_i > 0; r = \lim_{i\rightarrow \infty}c_{i}/c_{i+1} > 0\} can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we designed a series of DAG constraints and designed an efficient algorithm to evaluate these DAG constraints. Experiments on various settings show that our DAG constraints outperform previous state-of-the-arts approaches.

870DiffImp: Efficient Diffusion Model for Probabilistic Time Series Imputation with Bidirectional Mamba Backbone

[openreview] [pdf]

Abstract Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability to estimate uncertainty of imputation results. Meanwhile, denoising diffusion probabilistic models (DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1) \textit{ The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity.} 2) \textit{ The architecture of denoising modules can not handle the inter-variable and bidirectional dependencies in the time series imputation problem effectively.} To address the first challenge, we integrate the computational efficient state space model, namely Mamba, as the backbone denosing module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for bidirectional modeling and inter-variable relation understanding. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple datasets, different missing scenarios and missing ratios.

871Autoencoders for Anomaly Detection are Unreliable

[openreview] [pdf]

Abstract Autoencoders are frequently used for anomaly detection, both in the unsupervised and semi-supervised settings. They rely on the assumption that when trained using the reconstruction loss, they will be able to reconstruct normal data more accurately than anomalous data. Some recent works have posited that this assumption may not always hold, but little has been done to study the validity of the assumption in theory. In this work we prove that this assumption indeed does not hold, and show that anomalies, lying far away from normal data, can be perfectly reconstructed in practice. We extend the understanding of autoencoders for anomaly detection by showing how they can perfectly reconstruct out of bounds, or interpolate undesirably, and note how this can be dangerous in safety critical applications. We connect theory to practice by showing that the proven behavior in linear autoencoders also occurs when applying non-linear autoencoders on both tabular data and real-world image data, the two primary application areas of autoencoders for anomaly detection.

872Finally Rank-Breaking Conquers MNL Bandits: Optimal and Efficient Algorithms for MNL Assortment

[openreview] [pdf]

Abstract We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, and fine-tuning language models, amongst many others. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a ``strong reference" which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected---all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods.

873CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness

[openreview] [pdf]

Abstract Time series forecasting is the crucial infrastructure in the field of e-commerce, providing technical support for consumer behavior analysis, sales trends forecasting, etc. E-commerce allows consumers to reserve in advance. These pre-booking features reflect future sales trends and can increase the certainty of time series forecasting issues. In this paper, we define these features as Cross-Future Behavior, which occurs before the current time but takes effect in the future. To increase the performance of time series forecasting, we leverage these features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code will be released after formal publication.

874Leveraging Semantic and Positional Uncertainty for Trajectory Prediction

[openreview] [pdf]

Abstract Given a time horizon with historical movement data and environmental context, trajectory prediction aims to forecast the future motion of dynamic entities, such as vehicles and pedestrians. A key challenge in this task arises from the dynamic and noisy nature of real-time maps. This noise primarily stems from two resources: (1) positional errors due to sensor inaccuracies or environmental occlusions, and (2) cognitive errors resulting from incorrect scene understanding. In an attempt to solve this problem, we propose a new framework that estimates two kinds of uncertainty, \ie, positional uncertainty and semantic uncertainty simultaneously, and explicitly incorporates both uncertainties into the trajectory prediction process. In particular, we introduce a dual-head structure to independently perform semantic prediction twice and positional prediction twice, and further extract the prediction variance as the uncertainty indicator in an end-to-end manner. The uncertainty is then directly concatenated with the semantic and positional predictions to enhance the trajectory estimation. To validate the effectiveness of our uncertainty-aware approach, we evaluate it on the real-world driving dataset, \ie, nuScenes. Extensive experiments on 3 mapping estimation and 2 trajectory approaches show that the proposed method (1) effectively captures map noise through both positional and semantic uncertainties, and (2) seamlessly integrates and enhances existing trajectory prediction methods on multiple evaluation metrics, \ie, minADE, minFDE, and MR.

875Actionable Inverse Classification with Action Fairness Guarantees

[openreview] [pdf]

Abstract Machine learning (ML) classifiers are increasingly used in critical decision-making domains such as finance, healthcare, and the judiciary. However, their interpretability and fairness remain significant challenges, often leaving users without clear guidance on how to improve unfavourable outcomes. This paper introduces an actionable ML framework that provides minimal, explainable modifications to input data to change classification results. We also propose a novel concept of “action fairness,” which ensures that users from different subgroups incur similar costs when altering their classification outcomes. Our approach identifies the nearest decision boundary point to a given query, allowing for the determination of minimal cost actions. We demonstrate the effectiveness of this method using real-world credit assessment data, showing that our solution not only improves the fairness of classifier outcomes but also enhances their usability and interpretability.

876Better Instruction-Following Through Minimum Bayes Risk

[openreview] [pdf]

Abstract General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.

877On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

[openreview] [pdf]

Abstract Hallucination occurs when large language models (LLMs) exhibit behavior that deviates from the boundaries of their knowledge during the response generation process. Previous learning-based methods focus on detecting knowledge boundaries and finetuning models with instance-level feedback, but they suffer from inaccurate signals due to off-policy data sampling and coarse-grained feedback. In this paper, we introduce \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), a fine-grained feedback-based online reinforcement learning method for hallucination mitigation. Unlike previous learning-based methods, RLFH enables LLMs to explore the boundaries of their internal knowledge and provide on-policy, fine-grained feedback on these explorations. To construct fine-grained feedback for learning reliable generation behavior, RLFH decomposes the outcomes of large models into atomic facts, provides statement-level evaluation signals, and traces back the signals to the tokens of the original responses. Finally, RLFH adopts the online reinforcement algorithm with these token-level rewards to adjust model behavior for hallucination mitigation. For effective on-policy optimization, RLFH also introduces an LLM-based fact assessment framework to verify the truthfulness and helpfulness of atomic facts without human intervention. Experiments on HotpotQA, SQuADv2, and Biography benchmarks demonstrate that RLFH can balance their usage of internal knowledge during the generation process to eliminate the hallucination behavior of LLMs.

878Effective and Efficient Time-Varying Counterfactual Prediction with State-Space Models

[openreview] [pdf]

Abstract Time-varying counterfactual prediction (TCP) from observational data supports the answer of when and how to assign multiple sequential treatments, yielding importance in various applications. Despite the progress achieved by recent advances, e.g., LSTM or Transformer based causal approaches, their capability of capturing interactions in long sequences remains to be improved in both prediction performance and running efficiency. In parallel with the development of TCP, the success of the state-space models (SSMs) has achieved remarkable progress toward long-sequence modeling with saved running time. Consequently, studying how Mamba simultaneously benefits the effectiveness and efficiency of TCP becomes a compelling research direction. In this paper, we propose to exploit advantages of the SSMs to tackle the TCP task, by introducing a counterfactual Mamba model with Covariate-based Decorrelation towards Selective Parameters (Mamba-CDSP). Motivated by the over-balancing problem in TCP of the direct covariate balancing methods, we propose to de-correlate between the current treatment and the representation of historical covariates, treatments, and outcomes, which can mitigate the confounding bias while preserve more covariate information. In addition, we show that the overall de-correlation in TCP is equivalent to regularizing the selective parameters of Mamba over each time step, which leads our approach to be effective and lightweight. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that Mamba-CDSP not only outperforms baselines by a large margin, but also exhibits prominent running efficiency.

879Incorporating Human Preferences into Interpretable Reinforcement Learning with Tree Policies

[openreview] [pdf]

Abstract Interpretable reinforcement learning (RL) seeks to create agents that are efficient, transparent, and understandable to the populations that they impact. A significant gap in current approaches is the underutilization of human feedback, which is typically employed only for post-hoc evaluation. We propose to center the needs of end users by incorporating the feedback that would be obtained in a user study directly into the training of interpretable RL algorithms. Our approach involves preference learning, where we learn preferences over high-level features that are not directly optimizable during the RL training process. We introduce an evolutionary algorithm that leverages user feedback to guide training toward interpretable decision-tree policies that are better-aligned with human preferences. We demonstrate the effectiveness of our method through experiments using synthetic preference data. Our results show an improvement in preference alignment compared to baselines, yielding policies that are more aligned with underlying user preferences but does so with sample efficiency in the number of user queries, thereby decreasing the burden on the user in providing such data.

880Reward Learning from Multiple Feedback Types

[openreview] [pdf]

Abstract Learning rewards from preference feedback has become an important tool in the alignment of agentic models. Preference-based feedback, often implemented as a binary comparison between multiple completions, is an established method to acquire large-scale human feedback. However, human feedback in other contexts is often much more diverse. Such diverse feedback can better support the goals of a human annotator, and the simultaneous use of multiple sources might be mutually informative for the learning process or carry type-dependent biases for the reward learning process. Despite these potential benefits, learning from different feedback types has yet to be explored extensively. In this paper, we bridge this gap by enabling experimentation and evaluating multi-type feedback in a wide set of environments. We present a process to generate high-quality simulated feedback of six different types. Then, we implement reward models and downstream RL training for all six feedback types. Based on the simulated feedback, we investigate the use of types of feedback across five RL environments and compare them to pure preference-based baselines. We show empirically that diverse types of feedback can be utilized simultaneously and lead to improved reward modeling performance. This work is the first strong indicator of the potential of true multi-type feedback for RLHF.

881Manifold Learning via Foliations, and Knowledge Transfer

[openreview] [pdf]

Abstract Understanding how real data is distributed in high dimensional spaces is the key to many tasks in machine learning. We want to provide a natural geometric structure on the space of data employing a deep ReLU neural network trained as a classifier. Through the data information matrix (DIM), a variation of the Fisher information matrix, the model will discern a singular foliation structure on the space of data. We show that the singular points of such foliation are contained in a measure zero set, and that a local regular foliation exists almost everywhere. Experiments show that the data is correlated with leaves of such foliation. Moreover we show the potential of our approach for knowledge transfer by analyzing the spectrum of the DIM to measure distances between datasets.

882Towards Generalization under Topological Shifts: A Diffusion PDE Perspective

[openreview] [pdf]

Abstract The capability of generalization is a cornerstone for the success of modern learning systems. For non-Euclidean data that particularly involves topological features, one important aspect neglected by prior studies is how learning-based models generalize under topological shifts. This paper makes steps towards understanding the generalization of graph neural networks operated on varying topologies through the lens of diffusion PDEs. Our analysis first reveals that the upper bound of the generalization error yielded by local diffusion equation models, which are intimately related to message passing over observed structures, would exponentially grow w.r.t. topological shifts. In contrast, extending the diffusion operator to a non-local counterpart that learns latent structures from data can in principle control the generalization error under topological shifts even when the model accommodates observed structures. On top of these results, we propose Advective Diffusion Transformer inspired by advective diffusion equations serving as a physics-inspired continuous model that synthesizes observed and latent structures for graph learning. The model demonstrates superiority in various downstream tasks across information networks, molecular screening and protein interactions.

883Diversity-Enhanced and Classification-Aware Prompt Learning for Few-Shot Learning via Stable Diffusion

[openreview] [pdf]

Abstract Recent text-to-image generative models have exhibited an impressive ability to generate fairly realistic images from some text prompts. In this work, we explore to leverage off-the-shelf text-to-image generative models to train non-specific downstream few-shot classification model architectures using synthetic dataset to classify real images. Current approaches use hand-crafted or model-generated text prompts of text-to-image generative models to generated desired synthetic images, however, they have limited capability of generating diversity images. Especially, their synthetic datasets has relatively limited relevance to the downstream classification tasks. This makes them fairly hard to guarantee training models from synthetic images are efficient in practice. To address this issue, we propose a method capable of adaptively learning proper text prompts for the off-the-shelf diffusion model to generate diverse and classification-aware synthetic images. Our approach shows notable improvements in various classification datasets, with results comparable to existing prompt designing methods. We find that replacing data generation strategy of existing zero/few-shot methods with proposed method could consistly improves downstream classification performance across different network architectures, demostrating its model-agnostic characteristic for few-shot learning. This makes it possible to train an efficient downstream few-shot learning models from synthetic images generated by proposed method for real problems.

884Learn out of the box: optimizing both diversity and performance in Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning, most existing methods have focused primarily on optimizing performance, often neglecting the promotion of diverse behaviors. While some approaches generate diverse behaviors from well-constructed, heterogeneous datasets, their effectiveness is significantly reduced when applied to less diverse data. To address this, we introduce a novel intrinsic reward mechanism that encourages behavioral diversity, irrespective of the dataset’s heterogeneity. By maximizing the mutual information between actions and policies under each state, our approach enables agents to learn a variety of behaviors, including those not explicitly represented in the data. Although performing out-of-distribution actions can lead to risky outcomes, we mitigate this risk by incorporating the ensemble-diversified actor-critic (EDAC) method to estimate Q-value uncertainty, preventing agents from adopting suboptimal behaviors. Through experiments using the D4RL benchmarks on MuJoCo tasks, we demonstrate that our method achieves behavioral diversity while maintaining performance across environments constructed from both heterogeneous and homogeneous datasets.

885OptionZero: Planning with Learned Options

[openreview] [pdf]

Abstract Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network in MuZero to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning.

886Delay-Aware Reinforcement Learning: Insights From Delay Distributional Perspective

[openreview] [pdf]

Abstract Although deep reinforcement learning (DRL) has achieved great success across various domains, the presence of random delays in real-world scenarios (e.g., remote control) poses a significant challenge to its practicality. Existing delay-aware DRLs mainly focus on state augmentation with historical memory, ensuring that the actions taken are aligned with the true state. However, these approaches still rely on the conventional expected QQ value. In contrast, to model delay uncertainty, we aim to go beyond the expected value and propose a distributional DRL to represent the distribution of this QQ value. Based on the delay distribution, we further propose a correction mechanism for the distributional QQ value, enabling the agent to learn accurate returns in delayed environments. Finally, we apply these techniques to design the delay-aware distributional actor-critic (DADAC) DRL framework, in which the critic is the corrected distributional value function. Experimental results demonstrate that compared to the state-of-the-art delay-aware DRL methods, the proposed DADAC exhibits substantial performance advantages in handling random delays in the MuJoCo continuous control tasks. The corresponding source code is available athttps://anonymous.4open.science/r/DADAC.

887BOND: Aligning LLMs with Best-of-N Distillation

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.

888Task Facet Learning: A Structured Approach to Prompt Optimization

[openreview] [pdf]

Abstract Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model. Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We exploit structure in the prompt optimization problem and break down a prompt into loosely coupled semantic sections. The proposed algorithm, UniPrompt, (1) clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task, and (2) utilizes a feedback mechanism to propose adding, editing or deleting a section, which in turn is aggregated over a batch to capture generalizable facets. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using UniPrompt obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate.

889Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting

[openreview] [pdf]

Abstract Flow matching has recently emerged as a powerful paradigm for generative modeling, and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.

890Toward Principled Transformers for Knowledge Tracing

[openreview] [pdf]

Abstract Knowledge tracing aims to reason about changes in students’ knowledge and to predict students’ performance in educational learning settings. We propose knowledge tracing set transformers (KTSTs), a straightforward model class for knowledge tracing prediction tasks. This model class is conceptually simpler than previous state-of-the-art approaches, which are overly complex due to domain-inspired components, and which are in part based on suboptimal design choices and flawed evaluation. In contrast, for KTSTs we propose principled set representations of student interactions and a simplified variant of learnable modification of attention matrices for positional information in a student’s learning history. While being largely domain-agnostic, the proposed model class thus accounts for characteristic traits of knowledge tracing tasks. In extensive empirical experiments on standardized benchmark datasets, KTSTs establish new state-of-the-art performance.

891Bayesian Learning of Adaptive Koopman Operator with Application to Robust Motion Planning for Autonomous Trucks

[openreview] [pdf]

Abstract Koopman theory has recently been shown to enable an efficient data-driven approach for modeling physical systems, offering a linear framework despite underlying nonlinear dynamics. It is, however, not clear how to account for uncertainty or temporal distributional shifts within this framework, both commonly encountered in real-world autonomous driving with changing weather conditions and time-varying vehicle dynamics. In this work, we introduce Bayesian learning of adaptive Koopman operator to address these limitations. Specifically, we propose a Bayesian Koopman operator that incorporates uncertainty quantification, enabling more robust predictions. To tackle distributional shifts, we propose an online adaptation mechanism, ensuring the operator remains responsive to changes in system dynamics. Additionally, we apply the architecture to motion planning and show that it gives fast and precise predictions. By leveraging uncertainty awareness and real-time updates, our planner generates dynamically accurate trajectories and makes more informed decisions. We evaluate our method on real-world truck dynamics data under varying weather conditions—such as wet roads, snow, and ice—where uncertainty and dynamic shifts are prominent, as well as in other simulated environments. The results demonstrate our method’s ability to deliver accurate, uncertainty-aware open-loop predictions for dynamic systems.

892OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

[openreview] [pdf]

Abstract Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods model the scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency to model long-term temporal evolutions.To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving.

893Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

[openreview] [pdf]

Abstract Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm calledAdaDPSynto generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is thePrecision-Focused Iterative Radius Reductiontechnique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

894Advantage Alignment Algorithms

[openreview] [pdf]

Abstract Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.

895Adversarial Guided Diffusion Models for Adversarial Purification

[openreview] [pdf]

Abstract Diffusion model (DM) based adversarial purification (AP) has proven to be a powerful defense method that can remove adversarial perturbations and generate a purified example without threats. In principle, the pre-trained DMs can only ensure that purified examples conform to the same distribution of the training data, but it may inadvertently compromise the semantic information of input examples, leading to misclassification of purified examples. Recent advancements introduce guided diffusion techniques to preserve semantic information while removing the perturbations. However, these guidances often rely on distance measures between purified examples and diffused examples, which can also preserve perturbations in purified examples. To further unleash the robustness power of DM-based AP, we propose an adversarial guided diffusion model (AGDM) by introducing a novel adversarial guidance that contains sufficient semantic information but does not explicitly involve adversarial perturbations. The guidance is modeled by an auxiliary neural network obtained with adversarial training, considering the distance in the latent representations rather than at the pixel-level values. Extensive experiments are conducted on CIFAR-10, CIFAR-100 and ImageNet to demonstrate that our method is effective for simultaneously maintaining semantic information and removing the adversarial perturbations. In addition, comprehensive comparisons show that our method significantly enhances the robustness of existing DM-based AP, with an average robust accuracy improved by up to 7.30% on CIFAR-10. The code will be available upon acceptance.

896A General Aggregation Federated Learning Intervention Algorithm based ondo-Calculus

[openreview] [pdf]

Abstract This article explores federated long-tail learning (Fed-LT) tasks, which involve clients with private and heterogeneous data that exhibit a long-tail distribution. We propose two methods: (a) Client Re-weighted Prior Analyzer (CRePA), which balances the global model’s performance on tail and non-tail categories and enhances performance on tail categories while maintaining it on non-tail categories. (b) Federated Long-Tail Causal Intervention Model (FedLT-CI) computes clients’ causal effects on the global model’s performance in the tail and enhances the interpretability of Fed-LT. CRePA achieves state-of-the-art performance, and FedLT-CI improves tail performance significantly without affecting non-tail performance. Extensive experiments indicate that CRePA achieved SOTA performance compared to other baselines on CIFAR-10-LT and CIFAR-100-LT. Applying the FedLT-CI to all baselines significantly improved tail performance without affecting non-tail performance.

897Multi Task Inverse Reinforcement Learning for Common Sense Reward

[openreview] [pdf]

Abstract One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like “reward hacking” where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can learn a useful reward function.

898WardropNet: Traffic Flow Predictions via Equilibrium-Augmented Learning

[openreview] [pdf]

Abstract When optimizing transportation systems, anticipating traffic flows is a central element. Yet, computing such traffic equilibria remains computationally expensive. Against this background, we introduce a novel combinatorial optimization augmented neural network architecture that allows for fast and accurate traffic flow predictions. We propose WardropNet, a neural network that combines classical layers with a subsequent equilibrium layer: the first ones inform the latter by predicting the parameterization of the equilibrium problem’s latency functions. Using supervised learning we minimize the difference between the actual traffic flow and the predicted output. We show how to leverage a Bregman divergence fitting the geometry of the equilibria, which allows for end-to-end learning. WardropNet outperforms pure learning-based approaches in predicting traffic equilibria for realistic and stylized traffic scenarios. On realistic scenarios, WardropNet improves on average for time-invariant predictions by up to 72% and for time-variant predictions by up to 23% over pure learning-based approaches.

899Point Cloud Dataset Distillation

[openreview] [pdf]

Abstract This study introduces dataset distillation (DD) tailored for 3D data, particularly point clouds. DD aims to substitute large-scale real datasets with a small set of synthetic samples while preserving model performance. Existing methods mainly focus on structured data such as images. However, adapting DD for unstructured point clouds poses challenges due to their diverse orientations and resolutions in 3D space. To address these challenges, we theoretically demonstrate the importance of matching rotation-invariant features between real and synthetic data for 3D distillation. We further propose a plug-and-play point cloud rotator to align the point cloud to a canonical orientation, facilitating the learning of rotation-invariant features by all point cloud models. Furthermore, instead of optimizing fixed-size synthetic data directly, we devise a point-wise generator to produce point clouds at various resolutions based on the sampled noise amount. Compared to conventional DD methods, the proposed approach, termed DD3D, enables efficient training on low-resolution point clouds while generating high-resolution data for evaluation, thereby significantly reducing memory requirements and enhancing model scalability. Extensive experiments validate the effectiveness of DD3D in shape classification and part segmentation tasks across diverse scenarios, such as cross-architecture and cross-resolution settings.

900Dominant Shuffle: An Incredibly Simple but Exceptionally Effective Data Augmentation Method for Time-Series Prediction

[openreview] [pdf]

Abstract Frequency-domain data augmentation (DA) has shown strong performance in time-series prediction due to its ability to preserve data-label consistency. However, we observed that existing frequency-domain augmentations introduce excessive variability, leading to out-of-distribution samples that may be harmful to model performance. To address this, we introduced two simple modifications to frequency-domain DA. First, we limit perturbations to dominant frequencies with larger magnitudes, which capture the main periodicities and trends of the signal. Second, instead of using complicated random perturbations, we simply shuffle the dominant frequency components, which preserves the original structure while avoiding external noise. With the two simple modifications, we proposed dominant shuffle—a simple yet highly effective data augmentation technique for time-series prediction. Our method is remarkably simple, requiring only a few lines of code, yet exceptionally effective, consistently and significantly improving model performance. Extensive experiments on short-term, long term, few-shot and cold-start prediction tasks with eight state-of-the-art models, nine existing augmentation methods and twelve datasets demonstrate that dominant shuffle consistently boosts model performance with substantial gains, outperforming existing augmentation techniques.

901Enhanced Diffusion Sampling via Extrapolation with Multiple ODE Solutions

[openreview] [pdf]

Abstract Diffusion probabilistic models (DPMs), while effective in generating high-quality samples, often suffer from high computational costs due to the iterative sampling process. To address this, we propose an enhanced ODE-based sampling method for DPMs inspired by Richardson extrapolation, which has been shown to reduce numerical error and improve convergence rates. Our method, termed RX-DPM, utilizes numerical solutions obtained over multiple denoising steps, leveraging the multiple ODE solutions to extrapolate the denoised prediction in DPMs. This significantly enhances the accuracy of estimations for the final sample while preserving the number of function evaluations (NFEs). Unlike standard Richardson extrapolation, which assumes uniform discretization of the time grid, we have developed a more general formulation tailored to arbitrary time step scheduling, guided by the local truncation error derived from a baseline sampling method. The simplicity of our approach facilitates accurate estimation of numerical solutions without additional computational overhead, and allows for seamless and convenient integration into various DPMs and solvers. Additionally, RX-DPM provides explicit error estimates, effectively illustrating the faster convergence achieved as the order of the leading error term increases. Through a series of experiments, we demonstrate that the proposed method effectively enhances the quality of generated samples without requiring additional sampling iterations.

902Double Check My Desired Return: Transformer with Value Validation for Offline RL

[openreview] [pdf]

Abstract Recently, there has been increasing interest in applying Transformers to offline reinforcement learning (RL). Existing methods typically frame offline RL as a sequence modeling problem and learn actions via Supervised learning (RvS). However, RvS-trained Transformers struggle to align actual returns with desired target returns, especially when dealing with underrepresented returns in the dataset (interpolation) or missed higher returns that could be achieved by stitching sub-optimal trajectories (extrapolation). In this work, we propose a novel method that Double Checks the Transformer with value validation for Offline RL (Doctor). Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value function. SL stabilizes the prediction of actions conditioned on target returns, while TD learning adds stitching capability to the Transformer. During inference, we introduce a double-check mechanism. We sample actions around desired target returns and validate them with value functions. This mechanism ensures better alignment between the predicted action and the desired target return and is beneficial for further online exploration and fine-tuning. We evaluate Doctor on the D4RL benchmark in both offline and offline-to-online settings, demonstrating that Doctor does much better in return alignment, either within the dataset or beyond the dataset. Furthermore, Doctor performs on par with or outperforms existing RvS-based and TD-based offline RL methods on the final performance.

903Decentralized Sporadic Federated Learning: A Unified Algorithmic Framework with Convergence Guarantees

[openreview] [pdf]

Abstract Decentralized federated learning (DFL) captures FL settings where both (i) model updates and (ii) model aggregations are exclusively carried out by the clients without a central server. Existing DFL works have mostly focused on settings where clients conduct a fixed number of local updates between local model exchanges, overlooking heterogeneity and dynamics in communication and computation capabilities. In this work, we propose Decentralized Sporadic Federated Learning (DSpodFL\texttt{DSpodFL}), a DFL methodology built on a generalized notion ofsporadicityin both local gradient and aggregation processes. DSpodFL\texttt{DSpodFL} subsumes many existing decentralized optimization methods under a unified algorithmic framework by modeling the per-iteration (i) occurrence of gradient descent at each client and (ii) exchange of models between client pairs as arbitrary indicator random variables, thus capturingheterogeneous and time-varyingcomputation/communication scenarios. We analytically characterize the convergence behavior of DSpodFL\texttt{DSpodFL} for both convex and non-convex models and for both constant and diminishing learning rates, under mild assumptions on the communication graph connectivity, data heterogeneity across clients, and gradient noises. We show how our bounds recover existing results from decentralized gradient descent as special cases. Experiments demonstrate that DSpodFL\texttt{DSpodFL} consistently achieves improved training speeds compared with baselines under various system settings.

904Discrete Bregman Divergence

[openreview] [pdf]

Abstract The Bregman divergence, which is generated from a convex function, is commonly used as a pseudo-distance for comparing vectors or functions in continuous spaces. In contrast, defining an analog of the Bregman divergence for discrete spaces is nontrivial. Iyer and Bilmes (2012a) considered Bregman divergences on discrete domains using submodular functions as generating functions, the discrete analogs of convex functions. In this paper, we further generalize this framework to cases where the generating function is neither submodular nor supermodular, thus increasing the flexibility and representational capacity of the resulting divergence, which we term the discrete Bregman divergence. Additionally, we introduce a learnable form of this divergence using permutation-invariant neural networks (NNs) and demonstrate through experiments that it effectively captures key structural properties in discrete data, outperforming existing methods on tasks such as clustering. This work addresses the challenge of defining meaningful divergences in discrete settings and provides a new tool for tasks requiring structure-preserving distance measures.

905Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

[openreview] [pdf]

Abstract We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning onsingle-headtransformers with only asingleself-attention layer: (i) is universal, and (ii) supports efficient (even nearly-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-dLdL and -in-(1/ϵ)(1/\epsilon) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of thesoft-prompt-inducedkeys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

906Advantage-Guided Distillation for Preference Alignment in Small Language Models

[openreview] [pdf]

Abstract Alignment techniques such as RLHF enable LLMs to generate outputs that align with human preferences and play an essential role in their effectiveness. However, their impact often diminishes when applied to smaller language models, likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to smaller models, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher’s knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the contrastive effect, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student’s alignment. Our experimental results demonstrate that these two approaches appreciably improve the alignment of smaller language models and narrow the performance gap with their larger counterparts.

907Exploring the Causal Mechanisms: Towards Robust and Explainable Algorithm Selection

[openreview] [pdf]

Abstract Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and meta-features. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces causality to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging causal relationships to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned causal graph and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability. Extensive experiments on the ASlib benchmark validate the advantages of the proposed model in terms of robustness and explainability. The code will make publicly available after the review process.

908Unlocking Video-LLM via Agent-of-Thoughts Distillation

[openreview] [pdf]

Abstract This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large generative video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we proposeAgent-of-ThoughtsDistillation (AoTD), a method that enhances generative models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

909R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This involves the initial training of a reward model based on pairwise human feedback. The reward model is subsequently utilized in reinforcement learning to assess the scores of each generated sentence as a whole, further guiding the optimization of LLMs. However, current approaches have a significant shortcoming: They allocate a single, sparse, and delayed reward to an entire sequence of output. This may overlook some significant individual contributions of each token towards the desired outcome. To overcome this limitation, our paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation. Specifically, our method treats the reward prediction task of the reward model as a regression problem. As a result, the redistributed rewards are computed by evaluating the specific contribution of each token to the reward model’s output. This detailed approach improves the model’s understanding of language nuances, leading to more precise enhancements in its performance. Our method is crafted to integrate seamlessly with most current techniques while incurring minimal computational costs. Through comprehensive experiments across diverse datasets and tasks, we have verified the effectiveness and superiority of our approach.

910Spreading Out-of-Distribution Detection on Graphs

[openreview] [pdf]

Abstract Node-level out-of-distribution (OOD) detection on graphs has received significant attention from the machine learning community. However, previous approaches are evaluated using unrealistic benchmarks that consider only randomly selected OOD nodes, failing to reflect the interactions among nodes. In this paper, we introduce a new challenging task to model the interactions of OOD nodes in a graph, termed spreading OOD detection, where a newly emerged OOD node spreads its property to neighboring nodes. We curate realistic benchmarks by employing the epidemic spreading models that simulate the spreading of OOD nodes on the graph. We also showcase a ``Spreading COVID-19" dataset to demonstrate the applicability of spreading OOD detection in real-world scenarios. Furthermore, to effectively detect spreading OOD samples under the proposed benchmark setup, we present a new approach called energy distribution-based detector (EDBD), which includes a novel energy-aggregation scheme. EDBD is designed to mitigate undesired mixing of OOD scores between in-distribution (ID) and OOD nodes. Our extensive experimental results demonstrate the superiority of our approach over state-of-the-art methods in both spreading OOD detection and conventional node-level OOD detection tasks across seven benchmark datasets.

911Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis of the role of model complexity

[openreview] [pdf]

Abstract While overparameterization is known to benefit generalization, its impact on Out-Of-Distribution (OOD) detection is less understood. This paper investigates the influence of model complexity in OOD detection. We propose an expected OOD risk metric to evaluate classifiers confidence on both training and OOD samples. Leveraging Random Matrix Theory, we derive bounds for the expected OOD risk of binary least-squares classifiers applied to Gaussian data. We show that the OOD risk depicts an infinite peak, when the number of parameters is equal to the number of samples, which we associate with the double descent phenomenon. Our experimental study on different OOD detection methods across multiple neural architectures extends our theoretical insights and highlights a double descent curve. Our observations suggest that overparameterization does not necessarily lead to better OOD detection. Using the Neural Collapse framework, we provide insights to better understand this behavior. To facilitate reproducibility, our code will be made publicly available upon publication.

912Open-World Test-Time Training: Self-Training with Contrastive Learning

[openreview] [pdf]

Abstract Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set that limits their applicability in real-world scenarios with infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the primary focus has been on distinguishing between strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, leading to reduced contrast and premature classification of certain classes as strong OOD. To handle this problem, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that leverage contrastive learning to augment positive sample pairs. This strategy not only enhances contrast in the early stages but also significantly enhances model robustness in later stages. In comparison datasets, our OWDCL model achieves state-of-the-art performance.

913Breaking through Data Scarcity: Knowledge Transfer in Offline Reinforcement Learning

[openreview] [pdf]

Abstract We focus on knowledge transfer in offline reinforcement learning (RL), which aims to significantly improve the learning of an optimal policy in a target task based on a pre-collected dataset without further interactions with the environment. Data scarcity and high-dimensional feature spaces seriously pose challenges to offline RL in many real-world applications, and knowledge transfer offers a promising solution. We propose a novel and comprehensive knowledge transfer framework for offline RL, which carefully considers the relationship between the target and source tasks within the linear Markov decision process (MDP) framework. This enables efficient knowledge transfer from related source tasks to enhance learning in the target task and effectively address data scarcity concerns in offline RL. Our main contributions include establishing a relationship with the learning process between the target task and source task, introducing an effective and robust knowledge transfer technique to reduce the suboptimality of the learned policy, and demonstrating the significant effectiveness of the knowledge transfer framework through detailed theoretical analysis. Our work significantly contributes to the advancement of offline RL by providing a practical and robust framework for knowledge transfer facilitating more efficient and effective data utilization in various applications.

914BAYESIAN EXPERIMENTAL DESIGN VIA CONTRASTIVE DIFFUSIONS

[openreview] [pdf]

Abstract Bayesian Optimal Experimental Design (BOED) is a powerful tool to reduce the cost of running a sequence of experiments. When based on the Expected Information Gain (EIG), design optimization corresponds to the maximization of some intractable expectedcontrastbetween prior and posterior distributions. Scaling this maximization to high dimensional and complex settings has been an issue due to BOED inherent computational complexity. In this work, we introduce anexpected posteriordistribution with cost-effective sampling properties and provide a tractable access to the EIG contrast maximization via a new EIG gradient expression. Diffusion-based samplers are used to compute the dynamics of the expected posterior and ideas from bi-level optimization are leveraged to derive an efficient joint sampling-optimization loop, without resorting to lower bound approximations of the EIG. The resulting efficiency gain allows to extend BOED to the well-tested generative capabilities of diffusion models. By incorporating generative models into the BOED framework, we expand its scope and its use in scenarios that were previously impractical. Numerical experiments and comparison with state-of-the-art methods show the potential of the approach.

915Zero-shot Novel View Synthesis via Adaptive Modulating Video Diffusion Process

[openreview] [pdf]

Abstract By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose a new novel view synthesis paradigm that operates \textit{without} the need for training. The proposed method adaptively modulates the diffusion sampling process with the given views to enable the creation of visually pleasing results from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our method over state-of-the-art methods both quantitatively and qualitatively. The source code can be found on the anonymous webpage:https://github.com/PAPERID5494/VD_NVS. We also refer reviewers to the Supplementary Material for the video demo.

916FedGO : Federated Ensemble Distillation with GAN-based Optimality

[openreview] [pdf]

Abstract For federated learning in practical settings, a significant challenge is the considerable diversity of data across clients. To tackle this data heterogeneity issue, it has been recognized that federated ensemble distillation is effective. Federated ensemble distillation requires an unlabeled dataset on the server, which could either be an extra dataset the server already possesses or a dataset generated by training a generator through a data-free approach. Then, it proceeds by generating pseudo-labels for the unlabeled data based on the predictions of client models and training the server model using this pseudo-labeled dataset. Consequently, the efficacy of ensemble distillation hinges on the quality of these pseudo-labels, which, in turn, poses a challenge of appropriately assigning weights to client predictions for each data point, particularly in scenarios with data heterogeneity. In this work, we suggest a provably near-optimal weighting method for federated ensemble distillation, inspired by theoretical results in generative adversarial networks (GANs). Our weighting method utilizes client discriminators, trained at the clients based on a generator distributed from the server and their own datasets. Our comprehensive experiments on various image classification tasks illustrate that our method significantly improves the performance over baselines, under various scenarios with and without extra server dataset. Furthermore, we provide an extensive analysis of additional communication cost, privacy leakage, and computational burden caused by our weighting method.

917Greedy Learning to Optimize with Convergence Guarantees

[openreview] [pdf]

Abstract Learning to optimize is an approach that leverages training data to accelerate the solution of optimization problems. Many approaches use unrolling to parametrize the update step and learn optimal parameters. Although L2O has shown empirical advantages over classical optimization algorithms, memory restrictions often greatly limit the unroll length and learned algorithms usually do not provide convergence guarantees. In contrast, we introduce a novel method employing a greedy strategy that learns iteration-specific parameters by minimizing the function value at the next iteration. This enables training over significantly more iterations while maintaining constant memory usage. We parameterize the update such that parameter learning corresponds to solving a convex optimization problem at each iteration. In particular, we explore preconditioned gradient descent with multiple parametrizations including a novel convolutional preconditioner. With our learned algorithm, convergence in the training set is proven even when the preconditioner is neither symmetric nor positive definite. Convergence on a class of unseen functions is also obtained, ensuring robust performance and generalization beyond the training data. We test our learned algorithms on two inverse problems, image deblurring and Computed Tomography, on which learned convolutional preconditioner demonstrates improved empirical performance over classical optimization algorithms such as Nesterov’s Accelerated Gradient Method and the quasi-Newton method L-BFGS.

918Procedural Fairness Through Addressing Social Determinants of Opportunity

[openreview] [pdf]

Abstract Social determinants of opportunityare variables that, while not directly pertaining to any specific individual, capture key aspects of contexts and environments that have direct causal influences on certain attributes of an individual, e.g., environmental pollution in an area affects individual’s health condition, and educational resources in an neighborhood influence individual’s academic preparedness. Previous algorithmic fairness literature often overlookssocial determinants of opportunity, leading to implications for procedural fairness and structural justice that are incomplete and potentially even inaccurate. We propose a modeling framework that explicitly incorporatessocial determinants of opportunityand their causal influences on individual-level attributes of interest. To demonstrate theoretical perspectives and practical applicability of our framework, we consider college admissions as a running example. Specifically, for three mainstream admission procedures that have historically been implemented or are still in use today, we distinguish and draw connections between the outcome of admission decision-making and the underlying distribution of academic preparedness in the applicant population. Our findings suggest that mitigation strategies centering solely around protected features may introduce new procedural unfairness when addressing existing discrimination. Considering both individual-level attributes andsocial determinants of opportunityfacilitates a more comprehensive explication of benefits and burdens experienced by individuals from diverse demographic backgrounds as well as contextual environments, which is essential for understanding and achieving procedural fairness effectively and transparently.

919Topology-aware Graph Diffusion Model with Persistent Homology

[openreview] [pdf]

Abstract Generating realistic graphs presents challenges in estimating accurate distribution of graphs in an embedding space while preserving structural characteristics such as topology. However, existing graph generation methods primarily focus on approximating the joint distribution of graph nodes and edges, overlooking topology-wise similarity hindering accurate representation of global graph structures such as connected components and loops. To address this issue, we propose a topology-aware diffusion-based graph generation method that aims to closely resemble the structural characteristics of the original graph by leveraging persistent homology from topological data analysis (TDA). Specifically, we suggest a novel loss function, Persistence Diagram Matching (PDM) loss, which ensures the generated graphs to closely match the topology of the original graphs, enhancing their fidelity and preserving essential homological properties. Also, we introduce a novel topology-aware attention to enhance the self-attention module in the denoising network. Through comprehensive experiments, we demonstrate the effectiveness of our approach not only by exhibiting high generation performance across various metrics, but also by demonstrating a closer alignment with the distribution of topological features observed in the original graphs. In addition, application to real brain network data showcases its versatility and potential for complex and real graph application.

920METHODS OF IMPROVING LLM TRAINING STABILITY

[openreview] [pdf]

Abstract Training stability of large language models (LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small language model with 830M parameters and experiment with higher learning rates to force models to diverge, as in Wortsman et al. (2024). One of the sources of training instability is the growth of logits in attention layers Dehghani et al. (2023). We extend the focus of the previous work [Dehghani et al. (2023),Wortsman et al. (2024)] and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers (as it is done in [Dehghani et al. (2023), Wortsman et al. (2024)]) but after Proj and FC2 layers too; 2) apply layer normalization after the QKV layer (and remove pre normalization). 3) apply QK layer normalization together with softmax capping. We show that with the last two methods we can increase learning rate by 1.5x (without model divergence) in comparison to an approach based on QK layer normalization only Dehghani et al. (2023). Also we observe significant perplexity improvements for all three methods in comparison to the baseline model.

921LSTR: Long-Short Range Aggregation for Trajectory Prediction at Intersection Scenarios

[openreview] [pdf]

Abstract Trajectory prediction is crucial for practical applications, encompassing navigation for autonomous vehicles and the implementation of safety systems based on the Internet of Vehicles (IoV). Most existing methods significantly rely on comprehensive map information, employing robust rule constraints to incrementally predict trajectories within the driver’s local decision-making context. However, in environments characterized by weak rule enforcement, such as urban intersections, these approaches neglect the disparity between the driver’s overarching intentions and current behaviors.Recognizing the characteristics of intersection traffic flow—macroscopically organized yet microscopically disordered, exhibiting highly heterogeneous conditions—this paper presents a novel model termed Long-short Range Aggregation for Trajectory Prediction in Intersections (LSTR). This model anchors the vehicle’s local decision-making process to long-range intentions. Specifically, LSTR predicts the vehicle’s destination via a global intention inference module and models its long-range driving intentions through clustering to extract macroscopic traffic flow patterns. This long-range intention subsequently informs the short-range local interaction behaviors captured by the local behavior decision module. Ultimately, the fused features from these two modules are analyzed using a multi-modal decoder to interpret the various motion patterns, resulting in the trajectory prediction outcomes.We rigorously validate the proposed framework across multiple intersection scenarios utilizing real-world datasets, including inD, roundD, and a subset of WOMD. Experimental results demonstrate that our model outperforms numerous benchmarks without relying on additional information such as HD maps of intersections.

922Learning Symmetries through Loss Landscape

[openreview] [pdf]

Abstract Incorporating equivariance as an inductive bias into deep learning architectures, to take advantage of the data symmetry, has been successful in multiple applications such as chemistry and dynamical systems. The build of equivariance architecture, particularly w.r.t. roto-translations, is crucial for effectively modeling geometric graphs and molecules, where the understanding of 3D structures enhances generalization. However, despite their potential, equivariant models often pose challenges due to their high computational complexity. In this paper, we study the capabilities of unconstrained models (which do not build equivariance into the architecture) and how they generalize compared to equivariant models. We show that unconstrained models can learn approximate symmetries by minimizing additional simple equivariance loss. By formulating equivariance as a new learning objective, we can control the level of approximate equivariance in the model. Our method achieves competitive performance compared to equivariant baselines while being 10x faster at inference and 2.5x at training.

923Do Think Tags Really Help LLMs Plan? A Critical Evaluation of ReAct-Style Prompting

[openreview] [pdf]

Abstract The reasoning abilities of Large Language Models (LLMs) remain a topic of debate, which are critically tested in sequential decision-making problems. ReAct, a recently popular method has gained popularity for claiming to enhance LLM reasoning abilities while directly prompting them by interleaving reasoning trace with action execution"``\textit{interleaving reasoning trace with action execution}" in text-based planning domains such as AlfWorld and WebShop. However, given the different components of ReAct-style prompting, it remains unclear what the source of improvement in LLM performance is. In this paper, we critically examine the claims of ReAct-style prompting for sequential decision-making problems. By introducing systematic variations to the input prompt, we perform a sensitivity analysis along the original claims of ReAct. Contrary to these claims and common use-cases that utilize ReAct-style prompting, we find that the performance is minimally influenced by the interleaved reasoning trace or by the content of these generated reasoning traces. Instead, the performance of LLMs is primarily driven by the unreasonably high degree of similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our empirical results, on the same suite of domains as ReAct, show that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.

924In-Context Transfer Learning: Demonstration Synthesis by Transferring Similar Tasks

[openreview] [pdf]

Abstract In-context learning (ICL) is an effective approach to help large language models (LLMs) adapt to various tasks by providing demonstrations of the target task. Considering the high cost of labeling demonstrations, many methods propose synthesizing demonstrations from scratch using LLMs. However, the quality of the demonstrations synthesized from scratch is limited by the capabilities and knowledge of LLMs. To address this, inspired by transfer learning, we propose In-Context Transfer Learning (ICTL), which synthesizes target task demonstrations by transferring labeled demonstrations from similar source tasks. ICTL consists of two steps: source sampling and target transfer. First, we define an optimization objective, which minimizes transfer error to sample source demonstrations similar to the target task. Then, we employ LLMs to transfer the sampled source demonstrations to match the definition and format of the target task. Experiments on Super-NI show that ICTL outperforms synthesis from scratch by 2.0% on average, demonstrating the effectiveness of our method.

925MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures

[openreview] [pdf]

Abstract Machine learning models are often required to perform well across several pre-defined settings, such as a set of user groups. Worst-case performance is a common metric to capture this requirement, and is the objective of group distributionally robust optimization (group DRO). Unfortunately, these methods struggle when the loss is non-convex in the parameters, or the model class is non-parametric. Here, we make a classical move to address this: we reparameterize group DRO from parameter space to function space, which results in a number of advantages. First, we show that group DRO over the space of bounded functions admits a minimax theorem. Second, for cross-entropy and mean squared error, we show that the minimax optimal mixture distribution is the solution of a simple convex optimization problem. Thus, provided one is working with a model class of universal function approximators, group DRO can be solved by a convex optimization problem followed by a classical risk minimization problem. We call our method MixMax. In our experiments, we found that MixMax matched or outperformed the standard group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.

926Cross-Entropy Is All You Need To Invert the Data Generating Process

[openreview] [pdf]

Abstract Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.

927Torque-Aware Momentum

[openreview] [pdf]

Abstract Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

928Bandit Learning in Matching Markets with Indifference

[openreview] [pdf]

Abstract A rich line of recent works studies how participants in matching markets learn their unknown preferences through iterative interactions with each other. Two sides of participants in the market can be respectively formulated as players and arms in the bandit problem. To ensure market stability, the objective is to minimize the stable regret of each player. Though existing works provide significant theoretical upper bounds for players’ stable regret, the results heavily rely on the assumption that each participant has a strict preference ranking. However, in real applications, multiple candidates (e.g., workers in the labor market and students in school admission) usually demonstrate comparable performance levels, making it challenging for participants (e.g. employers and schools) to differentiate and rank their preferences. To deal with the potential indifferent preferences, we propose an adaptive exploration algorithm based on arm-guided Gale-Shapley (AE-AGS). We show that its stable regret is of order O(NKlogT/Δ2)O(NK \log T / \Delta^2), where NN is the number of players, KK the number of arms, TT the total time horizon, and Δ the minimum non-zero preference gap. To the best of our knowledge, this is the first polynomial regret bound applicable to the more general indifference setting, and it is only O(N)O(N) worse than the state-of-the-art result in the strict preference setting. Extensive experiments demonstrate the algorithm’s effectiveness in handling such complex situations and its consistent superiority over baselines.

929OccVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction

[openreview] [pdf]

Abstract In this paper, we propose OCCVAR, a generative occupancy world model that simulates the movement of the ego vehicle and the evolution of the surrounding environment. Different from visual generation, the occupancy world model should capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from the inefficiency and temporal degradation in long-time generation. To holistically address the efficiency and quality issues, we propose a spatial-temporal transformer via temporal next-scale prediction, aiming at predicting the 4D occupancy scenes from coarse to fine scales. To model the dynamic evolution of the scene, we incorporate the ego movement before the tokenized occupancy sequence, enabling the prediction of ego movement and controllable scene generation. To model the fine-grained 3D geometry, OCCVAR utilizes a muitli-scale scene tokenizer to capture the hierarchical information of the 3D scene. Experiments show that OCCVAR is capable of high-quality occupancy reconstruction, long-time generation and fast inference speed compared to prior works.

930DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

[openreview] [pdf]

Abstract In this paper, we introduce DiffusionTrend, a pioneering approach for virtual fashion try-on that forgoes the need for training diffusion models, thereby offering simple, conventional pose virtual try-on services with significantly reduced computational overhead. By leveraging advanced diffusion models, DiffusionTrend harnesses latents rich with prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers several significant advantages: (1) It circumvents the need for resource-intensive training of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling virtual try-on experience, underscoring the potential of training-free diffusion models for future research within the community. Overall, this initial foray into the application of untrained diffusion models in virtual try-on technology paves the way for further exploration and refinement in this innovative field.

931Achieving Optimal Breakdown for Byzantine-Robust Gossip

[openreview] [pdf]

Abstract Distributed approaches have many computational benefits, but they are vulnerable to attacks from a subset of devices transmitting incorrect information. This paper investigates Byzantine-resilient algorithms in a decentralized setting, where devices communicate directly with one another. ~We investigate the notion of \emph{breakdown point}, and show an upper bound on the number of adversaries that decentralized algorithms can tolerate. We introduce CG+\mathrm{CG}^+, an algorithm at the intersection of ClippedGossip\mathrm{ClippedGossip} and NNA\mathrm{NNA}, two popular approaches for robust decentralized learning. CG+\mathrm{CG}^+ meets our upper bound, and thus obtains optimal robustness guarantees, whereas neither of the existing two does. We provide experimental evidence for this gap by presenting an attack tailored to sparse graphs which breaks NNA\mathrm{NNA} but against which CG+\mathrm{CG}^+ is robust.

932Context-Aware Online Recommendation with Bayesian Incentive Compatibility

[openreview] [pdf]

Abstract Recommender systems play a crucial role in internet economies by connecting users with relevant products or services. However, designing effective recommender systems faces two key challenges: (1) the exploration-exploitation tradeoff in balancing new product exploration against exploiting known preferences, and (2) context-aware Bayesian incentive compatibility in accounting for users’ heterogeneous preferences and self-interested behaviors. This paper formalizes these challenges into a Context-aware Bayesian Incentive-Compatible Recommendation Problem (CBICRP). To address the CBICRP, we propose a two-stage algorithm (RCB) that integrates incentivized exploration with an efficient offline learning component for exploitation. In the first stage, our algorithm explores available products while maintaining context-aware Bayesian incentive compatibility to determine sufficient sample sizes. The second stage employs inverse proportional gap sampling integrated with arbitrary efficient machine learning method to ensure sublinear regret. Theoretically, we prove that RCB achieves O(KdT)O(\sqrt{KdT}) regret and satisfies Bayesian incentive compatibility (BIC). Empirically, we validate RCB’s strong incentive gain, sublinear regret, and robustness through simulations and a real-world application on personalized warfarin dosing. Our work provides a principled approach for incentive-aware recommendation in online preference learning settings.

933Online learning meets Adam: The Road of Interpretable Adaptive Optimizer Design

[openreview] [pdf]

Abstract This paper explores the theoretical foundations of Adam, a widely used adaptive optimizer. Building on recent developments in non-convex optimization and online learning, particularly the discounted-to-nonconvex conversion framework, we present two aspects of results: First, we introduce clip-free FTRL, a novel variant of the classical Follow-the-Regularized-Leader (FTRL) algorithm. Unlike scale-free FTRL and the recently proposed β-FTRL, our clip-free variant eliminates the need for clipping operations, aligning more closely with Adam’s practical implementation. This modification provides deeper theoretical insights into Adam’s empirical success and aligns the theoretical framework with practical implementations. By incorporating a refined analysis, our second result establishes a theoretical guarantee for the Last Iterate Convergence (LIC) under the proposed discounts-to-nonconvex conversion algorithm in LIC, which differs from the previous guarantee that has convergence evenly distributed in all iterations. Additionally, we extend this result to provide the last iterate convergence guarantee for the popular β-FTRL algorithm under the same framework. However, the derived last iterate convergence of β-FTRL reveals a persistent fixed error, potentially suggesting either limitations in popular online learning methods or the need for additional assumptions about the objective function.

934In-Context Reinforcement Learning From Suboptimal Historical Data

[openreview] [pdf]

Abstract Large-scale transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context Reinforcement Learning (RL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL instances, and then fix and use this transformer to create an action policy for new RL instances. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT), which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

935Propagation Alone is Enough for Graph Contrastive Learning

[openreview] [pdf]

Abstract Graph contrastive learning has recently gained substantial attention, leading to the development of various methodologies. In this work, we reveal that a simple training-free propagation method PROP achieves competitive results over dedicatedly designed GCL methods across a diverse set of benchmarks. We elucidate the underlying rationale for PROP’s effectiveness by drawing connections between the propagation operator and established unsupervised learning algorithms. To investigate the reasons for the suboptimal performance of existing GCL methods, we decouple the propagation and transformation phases of graph neural networks. Our findings indicate that GCL inadequately learns effective transformation weights while exhibiting potential for solid propagation learning. In light of these insights, we enhance PROP with learnable propagation, introducing a novel GCL method termed PROPGCL. The effectiveness of PROPGCL is demonstrated through comprehensive evaluations.

936Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples

[openreview] [pdf]

Abstract The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on the reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning can improve LLM reasoning quality, but requires extensive supervised data to capture the full range of possible solutions. Reinforcement learning aims to find limited highest-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample diverse reasoning paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across five challenging puzzle-solving tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), PrOntoQA (logical reasoning), Rubik’s Cube (spatial reasoning), and 1D-ARC (abstraction reasoning).

937From Global Assessment to Local Selection: Efficiently Solving Traveling Salesman Problems of All Sizes

[openreview] [pdf]

Abstract The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem with broad real-world applications. Recent advancements in neural network-based TSP solvers have shown promising results. Nonetheless, these models often struggle to efficiently solve both small- and large-scale TSPs using the same set of pre-trained model parameters, limiting their practical utility. To address this issue, we introduce a novel neural TSP solver named GELD, built upon our proposed broad global assessment and refined local selection framework. Specifically, GELD integrates a lightweight Global-view Encoder (GE) with a heavyweight Local-view Decoder (LD) to enrich embedding representation while accelerating the decision-making process. Moreover, GE incorporates a novel low-complexity attention mechanism, allowing GELD to achieve low inference latency and scalability to larger-scale TSPs. Additionally, we propose a two-stage training strategy that utilizes training instances of different sizes to bolster GELD’s generalization ability. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that GELD outperforms seven state-of-the-art models considering both solution quality and inference speed. Furthermore, GELD can be employed as a post-processing method to exchange affordable computing time for significantly improved solution quality, capable of solving TSPs with up to 744,710 nodes without relying on divide-and-conquer strategies.

938Experimental Design for Nonstationary Optimization

[openreview] [pdf]

Abstract Traditional methods for optimizing neural networks often struggle when used to train networks in settings where the data distributions change, and plasticity preservation methods have been shown to improve performance in such settings (e.g. continual learning and reinforcement learning). With the growing inter- est in nonstationary optimization and plasticity research, there is also a growing need to properly define experimental design and hyperparameter search protocols to enable principled research. Each new proposed work typically adds several new hyperparameters makes many more design decisions such as hyperparame- ter selection protocols, evaluation protocols, and types of tasks examined. While innovation in experiment design is important, it is also necessary to (1) question whether those innovations are leading to the best progress and (2) have standard- ized practices that make it easier to directly compare to prior works. In this paper, we first perform an extensive empirical study of over 27,000 trials looking at the performance of different methods and hyperparameters across different settings and architectures used in the literature to provide an evaluation of these methods and the hyperparameters they use under similar experimental conditions. We then examine several core experiment design choices made by the community, affirm- ing some while providing evidence against others, and provide concrete recom- mendations and analysis that can be used to guide future research.

939Turn-by-Turn Driving Navigation: Leveraging Sequence Model for Real-time Audio Instructions

[openreview] [pdf]

Abstract Turn-by-turn (TBT) navigation systems are integral to modern driving experiences, providing real-time audio instructions to guide drivers safely to destinations. However, existing audio instruction policy often rely on rule-based approaches that struggle to balance informational content with cognitive load, potentially leading to driver confusion or missed turns in complex environments. To overcome these difficulties, we first model the generation of audio instructions as a multi-task learning problem by decomposing the audio content into combinations of modular elements. Then, we propose a novel deep learning framework that leverages the powerful spatiotemporal information processing capabilities of Transformers and the strong multi-task learning abilities of Mixture of Experts (MoE) to generate real-time, context-aware audio instructions for TBT driving navigation. A cloud-edge collaborative architecture is implemented to handle the computational demands of the model, ensuring scalability and real-time performance for practical applications. Experimental results in the real world demonstrate that the proposed method significantly reduces the yaw rate compared to traditional methods, delivering clearer and more effective audio instructions. This is the first large-scale application of deep learning in driving audio navigation, marking a substantial advancement in intelligent transportation and driving assistance technologies.

940Learning Neural Networks with Distribution Shift: Efficiently Certifiable Guarantees

[openreview] [pdf]

Abstract We give the first provably efficient algorithms for learning neural networks with respect to distribution shift. We work in the Testable Learning with Distribution Shift framework (TDS learning) of Klivans et al. (2024), where the learner receives labeled examples from a training distribution and unlabeled examples from a test distribution and must either output a hypothesis with low test error or reject if distribution shift is detected. No assumptions are made on the test distribution.All prior work in TDS learning focuses on classification, while here we must handle the setting of nonconvex regression. Our results apply to real-valued networks with arbitrary Lipschitz activations and work whenever the training distribution has strictly sub-exponential tails. For training distributions that are bounded and hypercontractive, we give a fully polynomial-time algorithm for TDS learning one hidden-layer networks with sigmoid activations. We achieve this by importing classical kernel methods into the TDS framework using data-dependent feature maps and a type of kernel matrix that couples samples from both train and test distributions.

941Convergence Analysis of the Wasserstein Proximal Algorithm beyond Convexity

[openreview] [pdf]

Abstract The proximal algorithm is a powerful tool to minimize nonlinear and nonsmooth functionals in a general metric space. Motivated by the recent progress in studying the training dynamics of the noisy gradient descent algorithm on two-layer neural networks in the mean-field regime, we provide in this paper a simple and self-contained analysis for the convergence of the general-purpose Wasserstein proximal algorithm without assuming geodesic convexity on the objective functional. Under a natural Wasserstein analog of the Euclidean Polyak-{\L}ojasiewicz inequality, we show that the proximal algorithm achieves an unbiased and linear convergence rate. Our convergence rate improves upon existing rates of the proximal algorithm for solving Wasserstein gradient flows under strong geodesic convexity. We also extend our analysis to the inexact proximal algorithm for geodesically semiconvex objectives. In our numerical experiments, proximal training demonstrates a faster convergence rate than the noisy gradient descent algorithm on mean-field neural networks.

942Towards Black-Box Membership Inference Attack for Diffusion Models

[openreview] [pdf]

Abstract Given the rising popularity of AI-generated art and the associated copyright con- cerns, identifying whether an artwork was used to train a diffusion model is an important research topic. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitation of applying existing MIA methods for proprietary diffusion models: the required access of internal U-nets. To address the above problem, we introduce a novel member- ship inference attack method that uses only the image-to-image variation API and operates without access to the model’s internal U-net. We validate our method using DDIM and Stable Diffusion setups and further extend both our approach and existing algorithms to the Diffusion Transformer architecture. Our experimental results consistently outperform previous methods.

943REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability

[openreview] [pdf]

Abstract Understanding the agent’s learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent’s decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent’s learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent’s learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent’s performance in the test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent’s learning process. The experiments demonstrate that explanations derived from this framework can effectively help optimize the training tasks, resulting in improved learning efficiency and final performance.

944Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks

[openreview] [pdf]

Abstract Grokking is the intriguing phenomenon of delayed generalization: networks initially memorize training data with perfect accuracy but poor generalization, then transition to a generalizing solution with continued training. While reasons for this delayed generalization, such as weight norms and sparsity, have been discussed, the influence of network structure, particularly the role of subnetworks, still needs to be explored. In this work, we link the grokking phenomenon to the lottery ticket hypothesis to investigate the impact of inner network structures. We demonstrate that using lottery tickets obtained at the generalizing phase (`grokking tickets’) significantly reduces delayed generalization on various tasks, including multiple modular arithmetic, polynomial regression, sparse parity, and MNIST. For example, lottery tickets accelerate the grokking (transition from memorization to generalization) up to \emph{1/65} compared to dense networks in modular addition. Through a series of controlled experiments, our findings reveal that neither small weight norms nor sparsity alone account for the reduction of delayed generalization; instead, the presence of a good subnetwork structure is crucial. Analyzing the transition from memorization to generalization, we observe that rapid changes in subnetwork structures, measured by the Jaccard distance, strongly correlate with improvements in test accuracy. We further show that pruning techniques can accelerate the grokking process, transforming a memorizing network into a generalizing one without updating the weights. By demonstrating that good subnetworks are key to achieving generalization and that pruning can expedite this process, we provide new insights into the mechanisms underlying neural network generalization.

945LLM Cascade with Multi-Objective Optimal Consideration

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated exceptional capabilities in understanding and generating natural language. However, their high deployment costs often pose a barrier to practical applications, especially. Cascading local and server models offers a promising solution to this challenge. While existing studies on LLM cascades have primarily focused on the performance-cost trade-off, real-world scenarios often involve more complex requirements. This paper introduces a novel LLM Cascade strategy with Multi-Objective Optimization, enabling LLM cascades to consider additional objectives (e.g., privacy) and better align with the specific demands of real-world applications while maintaining their original cascading abilities. Extensive experiments on three benchmarks validate the effectiveness and superiority of our approach.

946Universal Concavity-Aware Descent Rate for Optimizers

[openreview] [pdf]

Abstract Many machine learning problems involve a challenging task of calibrating parameters in a computational model to fit the training data; this task is especially challenging for non-convex problems. Many optimization algorithms have been proposed to assist in calibrating these parameters, each with its respective advantages in different scenarios, but it is often difficult to determine the scenarios for which an algorithm is best suited. To contend with this challenge, much work has been done on proving the rate at which these optimizers converge to their final solution, however the wide variety of such convergence rate bounds, each with their own different assumptions, convergence metrics, tightnesses, and parameters (which may or may not be known to the practitioner) make comparing these convergence rates difficult. To help with this problem, we present a minmax-optimal algorithm and, by comparison to it, give a single descent bound which is applicable to a very wide family of optimizers, tasks, and data (including all of the most prevalent ones), which also puts special emphasis on being tight even in parameter subspaces in which the cost function is concave.

947Almost Sure Convergence of Average Reward Temporal Difference Learning

[openreview] [pdf]

Abstract Tabular average reward Temporal Difference (TD) learning is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning. After at least 25 years since its discovery, we are finally able to provide a long-awaited almost sure convergence analysis. Namely, we are the first to prove that, under very mild conditions, tabular average reward TD converges almost surely to a sample path dependent fixed point. Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise, built on recent advances in stochastic Krasnoselskii-Mann iterations.

948Differentiable Integer Linear Programming

[openreview] [pdf]

Abstract Machine learning (ML) techniques have shown great potential in generating high-quality solutions for integer linear programs (ILPs). However, existing methods typically rely on asupervised learningparadigm, leading to (1)expensive training costdue to repeated invocations of traditional solvers to generate training labels, and (2)plausible yet infeasible solutionsdue to the misalignment between the training objective (minimizing prediction loss) and the inference objective (generating high-quality solutions). To tackle this challenge, we proposeDiffILO(DifferentiableIntegerLinear ProgrammingOptimization), anunsupervised learning paradigm for learning to solve ILPs. Specifically, through a novel probabilistic modeling, DiffILO reformulates ILPs---discrete and constrained optimization problems---into continuous, differentiable (almost everywhere), and unconstrained optimization problems. This reformulation enables DiffILO to simultaneously solve ILPs and train the model via straightforward gradient descent, providing two major advantages. First, it significantly reduces the training cost, as the training process does not need the aid of traditional solvers at all. Second, it facilitates the generation of feasible and high-quality solutions, as the modellearns to solve ILPsin an end-to-end manner, thus aligning the training and inference objectives. Experiments on commonly used ILP datasets demonstrate that DiffILO not only achieves an average training speedup of 13.2 times compared to supervised methods, but also outperforms them by generating heuristic solutions with significantly higher feasibility ratios and much better solution qualities.

949RouteFinder: Towards Foundation Models for Vehicle Routing Problems

[openreview] [pdf]

Abstract This paper introduces RouteFinder, a comprehensive foundation model framework to tackle different Vehicle Routing Problem (VRP) variants. Our core idea is that a foundation model for VRPs should be able to represent variants by treating each as a subset of a generalized problem equipped with different attributes. We propose a unified VRP environment capable of efficiently handling any attribute combination. The RouteFinder model leverages a modern transformer-based encoder and global attribute embeddings to improve task representation. Additionally, we introduce two reinforcement learning techniques to enhance multi-task performance: mixed batch training, which enables training on different variants at once, and multi-variant reward normalization to balance different reward scales. Finally, we propose efficient adapter layers that enable fine-tuning for new variants with unseen attributes. Extensive experiments on 24 VRP variants show RouteFinder achieves competitive results. Our code is openly available.

950A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization

[openreview] [pdf]

Abstract Most decentralized optimization algorithms are handcrafted. While endowed with strong theoretical guarantees, these algorithms generally target a broad class of problems, thereby not being adaptive or customized to specific problem features. This paper studies data-driven decentralized algorithms trained to exploit problem features to boost convergence. Existing learning-to-optimize methods typically suffer from poor generalization or prohibitively vast search spaces. In addition, they face more challenges in decentralized settings where nodes must reach consensus through neighborhood communications without global information. To resolve these challenges, this paper first derives the necessary conditions that successful decentralized algorithmic rules need to satisfy to achieve both optimality and consensus. Based on these conditions, we propose a novelMathematics-inspiredLearning-to-optimize framework forDecentralizedoptimization (MiLoDo). Empirical results demonstrate that MiLoDo-trained algorithms outperform handcrafted algorithms and exhibit strong generalizations. Algorithms learned via MiLoDo in 100 iterations perform robustly when running 100,000 iterations during inferences. Moreover, MiLoDo-trained algorithms on synthetic datasets perform well on problems involving real data, higher dimensions, and different loss functions.

951Graph Supervised Contrastive Learning for Geodemographics

[openreview] [pdf]

Abstract Geodemographic analysis is essential for understanding population characteristics and addressing socio-economic disparities across regions. However, limited research has been conducted on modelling changes in demographic data over time using Graph Neural Networks (GNNs). In this study, we address this gap by leveraging GNNs to model correlations between the 2011 census data (England & Wales), observing changes over time, and the Output Area Classification 2021, which reflects socio-economic differences between Output Areas. We propose a novel framework that utilises Supervised Contrastive Learning on graphs to obtain robust OA embeddings, with a particular focus on improving the model’s performance for minority classes. To evaluate the effectiveness of our framework, we conducted two downstream tasks based on the 2021 OA embeddings. Our results demonstrate that the proposed approach provides valuable insights for geodemographic analysis and offers policymakers a useful tool for assessing socio-economic transitions over time, and planning ahead on the basis of it.

952Effective LLM Knowledge Learning Requires Rethinking Generalization

[openreview] [pdf]

Abstract Large language models (LLMs) are trained on a substantial amount of documents that contain extensive world knowledge. However, it is still not well-understood how knowledge is acquired via autoregressive pre-training and extracted via question-answering. This lack of understanding greatly hinders effective knowledge learning, especially for continued pre-training on up-to-date information, as this evolving information often does not have diverse repetitions like foundational knowledge. In this paper, we focus on understanding and improving LLM knowledge learning. We found and verified that knowledge learning for LLMs can be deemed as an implicit supervised task hidden in the autoregressive pre-training objective. Our findings suggest that knowledge learning for LLMs would benefit from methods designed to improve generalization ability for supervised tasks. Based on our analysis, we propose to diversify training documents’ formats as data augmentation to grow in-distribution samples. This data augmentation method does not present the risk of altering the facts embedded in documents as text paraphrasing. We also introduce sharpness-aware minimization as an effective optimization algorithm to better improve generalization. Moreover, we adapt our method to instruction tuning for generalization to various phrasings of questions. Extensive experiment results validate our findings and demonstrate our methods’ effectiveness in improving knowledge learning in both the continued pre-training and instruction tuning stages. This paper offers new perspectives and insights to interpret and design effective strategies for LLM knowledge learning.

[openreview] [pdf]

Abstract Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In other words, the former designs scores, while the latter designs probabilistic guarantees based on scores. Therefore, we claim that these two fields might be naturally intertwined. This work advocates for cross-fertilization between OOD and CP by formalizing their link and emphasizing two benefits of using them jointly. First, we show that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the test dataset’s finite sample size. Based on the work of (Bates et al, 2022), we define newconformal AUROCandconformal FRP@TPRβmetrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al, 2022) and ADBench (Han et al. 2022). Second, we explore using OOD scores as non-conformity scores and show that they can improve the efficiency of the prediction sets obtained with CP.

954Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games

[openreview] [pdf]

Abstract We study last-iterate convergence properties of algorithms for solving two-player zero-sum games based on Regret Matching+ (RM+). Despite their widespread use for solving real games, virtually nothing is known about their last-iterate convergence. A major obstacle to analyzing RM-type dynamics is that their regret operators lack Lipschitzness and (pseudo)monotonicity. We start by showing numerically that several variants used in practice, such as RM+, predictive RM+ and alternating RM+, all lack last-iterate convergence guarantees even on a simple 3×33\times 3 matrix game. We then prove that recent variants of these algorithms based on a smoothing technique, extragradient RM+ and smooth Predictive RM+, enjoy asymptotic last-iterate convergence (without a rate), 1/t1/\sqrt{t} best-iterate convergence, and when combined with restarting, linear-rate last-iterate convergence. Our analysis builds on a new characterization of the geometric structure of the limit points of our algorithms, marking a significant departure from most of the literature on last-iterate convergence. We believe that our analysis may be of independent interest and offers a fresh perspective for studying last-iterate convergence in algorithms based on non-monotone operators.

955Questioning Simplicity Bias Assumptions

[openreview] [pdf]

Abstract The Simplicity Bias (SB) is the observation that the training of most commonly used neural network architectures with standard training techniques is biased toward learning simple functions. This phenomenon can be a benefit or drawback depending on the relative complexity of the desired function to be learnt. If the desired function is relatively simple it’s a positive. However, if there are simpler features that are highly predictive; commonly named shortcuts or spurious features, that are not present in the test environment, the SB can result in poor generalisation performance. Most existing works on mitigating the SB make various assumptions, either about the features present in the train and test domains or by assuming access to information about the test domain at train time. In this paper we review recent work on the SB and take a critical look at these assumptions.

956Monophilic Neighbourhood Transformers

[openreview] [pdf]

Abstract Graph neural networks (GNNs) have seen widespread application across diverse fields, including social network analysis, chemical research, and computer vision. Nevertheless, their efficacy is compromised by an inherent reliance on the homophily assumption, which posits that adjacent nodes should exhibit relevance or similarity. This assumption becomes a limitation when dealing with heterophilic graphs, where it is more common for dissimilar nodes to be connected. Addressing this challenge, recent research indicates that real-world graphs generally exhibit monophily, a characteristic where a node tends to be related to the neighbours of its neighbours. Inspired by this insight, we introduce Neighbourhood Transformers (NT), a novel approach that employs self-attention within every neighbourhood of the graph to generate informative messages for the nodes within, as opposed to the central node in conventional GNN frameworks. We develop a neighbourhood partitioning strategy equipped with switchable attentions, significantly reducing space consumption by over 95% and time consumption by up to 92.67% in NT. Experimental results on node classification tasks across 5 heterophilic and 5 homophilic graphs demonstrate that NT outperforms current state-of-the-art methods, showcasing their expressiveness and adaptability to different graph types. The code for this study will be made available following the publication of this manuscript.

957Dynamic Elimination For PAC Optimal Item Selection From Relative Feedback

[openreview] [pdf]

Abstract We study the problem of best-item identification from relative feedback where a learner adaptively plays subsets of items and receives stochastic feedback in the form of the best item in the set. We propose an algorithm - Dynamic Elimination (DE) - that dynamically prunes sub-optimal items from contention to efficiently identify the best item and show a strong sample complexity upper bound for it. We further formalize the notion of inferred updates to obtain estimates on item win rates without directly playing them by leveraging item correlation information. We propose the Dynamic Elimination by Correlation (DEBC) algorithm as an extension to DE with inferred updates. We show through extensive experiments that DE and DEBC significantly outperform all existing baselines across multiple datasets in various settings.

958TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

[openreview] [pdf]

Abstract This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.

959How Far Are We from True Unlearnability?

[openreview] [pdf]

Abstract High-quality data plays an indispensable role in the era of large models, but the use of unauthorized data for model training greatly damages the interests of data owners. To overcome this threat, several unlearnable methods have been proposed, which generate unlearnable examples (UEs) by compromising the training availability of data. Clearly, due to unknown training purpose and the powerful representation learning capabilities of existing models, these data are expected to be unlearnable for various task models, i.e., they will not help improve the model’s performance. However, unexpectedly, we find that on the multi-task dataset Taskonomy, UEs still perform well in tasks such as semantic segmentation, failing to exhibit cross-task unlearnability. This phenomenon leads us to question: How far are we from attaining truly unlearnable examples? We attempt to answer this question from the perspective of model optimization. We observe the difference of convergence process between clean models and poisoned models on a simple model using the loss landscape and find that only a part of the critical parameter optimization paths show significant differences, implying a close relationship between the loss landscape and unlearnability. Consequently, we employ the loss landscape to explain the underlying reasons for UEs and propose Sharpness-Aware Learnability (SAL) for quantifying the unlearnability of parameters based on this explanation. Furthermore, we propose an Unlearnable Distance (UD) metric to measure the unlearnability of data based on the SAL distribution of parameters in clean and poisoned models. Finally, we conduct benchmark tests on mainstream unlearnable methods using the proposed UD, aiming to promote community awareness of the capability boundaries of existing unlearnable methods. The code is available athttps://github.com/MLsecurityLab/HowFarAreFromTrueUnlearnability.git.

960Characterizing the Training Dynamics of Private Fine-tuning with Langevin Diffusion

[openreview] [pdf]

Abstract We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DP-LP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with ReLU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights. We also derive new upper bounds for 2-layer linear networks without the approximation. Moreover, our theory suggests a trade-off of privacy budget allocation in multi-phase fine-tuning methods like DP-LP-FFT.

961On the Benefits of Attribute-Driven Graph Domain Adaptation

[openreview] [pdf]

Abstract Graph Domain Adaptation (GDA) addresses a pressing challenge in cross-network learning, particularly pertinent due to the absence of labeled data in real-world graph datasets. Recent studies attempted to learn domain invariant representations by eliminating structural shifts between graphs. In this work, we show that existing methodologies have overlooked the significance of the graph node attribute, a pivotal factor for graph domain alignment. Specifically, we first reveal the impact of node attributes for GDA by theoretically proving that in addition to the graph structural divergence between the domains, the node attribute discrepancy also plays a critical role in GDA. Moreover, we also empirically show that the attribute shift is more substantial than the topology shift, which further underscore the importance of node attribute alignment in GDA. Inspired by this finding, a novel cross-channel module is developed to fuse and align both views between the source and target graphs for GDA. Experimental results on a variety of benchmark verify the effectiveness of our method.

962Enhancing Multi-Objective Offline RL with Adaptive Preference Integration

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) is crucial for real-world applications where multiple conflicting goals must be optimized, such as in healthcare or autonomous systems. Offline MORL extends these benefits by using pre-collected datasets, allowing for effective learning without continuous interaction with the environment. However, existing offline MORL algorithms often struggle with scaling across large preference spaces and handling unknown preferences during evaluation. To address these challenges, we propose the Preference-Attended Multi-Objective Decision Transformer (PA-MODT), a novel architecture that integrates a preference-attention block with a modular transformer structure. This design enables effective generalization over different preferences and trajectories, providing a more robust approach to generating optimal Pareto fronts. We tested PA-MODT on five D4MORL datasets with millions of trajectories representing various objectives and found that it consistently outperforms existing models, achieving Pareto fronts that align closely with behavioral policy. This demonstrates PA-MODT’s potential to effectively manage complex multi-objective reinforcement learning tasks.

963Optimal Algorithm for Max-Min Fair Bandit

[openreview] [pdf]

Abstract We consider a multi-player multi-armed bandit problem (MP-MAB) where NN players compete for KK arms in TT rounds. The reward distribution is heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero reward. In this paper, we aim to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing regret upper bound result of O(logTloglogT)O(\log T\log \log T) to achieve max-min fairness. More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bounded of O((N2+K)logT/Δ)O((N^2+K)\log T / \Delta), where Δ is the minimum reward gap between max-min value and sub-optimal arms. We assume NKN\leq K to guarantee all players can select their arms without collisions. In addition, we also provide an Ω(maxN2,KlogT/Δ)\Omega(\max{N^2, K} \log T / \Delta) regret lower bound for this problem. This lower bound indicates that our algorithm is optimal with respect to key parameters, which significantly improves the performance of algorithms in previous work. Numerical experiments again verify the efficiency and improvement of our algorithms.

964CAN TRANSFORMERS IN-CONTEXT LEARN BEHAVIOR OF A LINEAR DYNAMICAL SYSTEM?

[openreview] [pdf]

Abstract We investigate whether transformers can learn to track a random process when given observations of a related process and parameters of the dynamical system that relates them as context. More specifically, we consider a finite-dimensional state-space model described by the state transition matrix FF, measurement matrices h1,,hNh_1, \dots, h_N, and the process and measurement noise covariance matrices QQ and RR, respectively; these parameters, randomly sampled, are provided to the transformer along with the observations y1,,yNy_1,\dots,y_N generated by the corresponding linear dynamical system. We argue that in such settings transformers learn to approximate the celebrated Kalman filter, and empirically verify this both for the task of estimating hidden states xN1,2,3,...,N^\hat{x_{N|1,2,3,...,N}} as well as for one-step prediction of the (N+1)st(N+1)^{st} observation, y^N+11,2,3,...,N\hat{y}_{N+1|1,2,3,...,N}. A further study of the transformer’s robustness reveals that its performance is retained even if the model’s parameters are partially withheld. In particular, we demonstrate that the transformer remains accurate at the considered task even in the absence of state transition and noise covariance matrices, effectively emulating operations of the Dual-Kalman filter.

965Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

[openreview] [pdf]

Abstract Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to problems such as router load imbalance where some experts receive more tokens than others. We present a lightweight approximation method that gives the MoE a dense gradient while only sparsely activating its parameters. A key insight into the design of our method is that at scale, many tokens not routed to a given expert may nonetheless lie in the span of tokens that were routed to that expert, allowing us to create an approximation for the expert output of that token from existing expert outputs. Our dense backpropagation outperforms standard TopK routing across multiple MoE configurations without increasing runtime.

966Learning to Plan with Personalized Preferences

[openreview] [pdf]

Abstract Understanding and adapting to human preferences is essential for the effective integration of artificial agents into daily human life, particularly as AI becomes increasingly involved in collaboration and assistance roles. Previous studies on preference recognition in embodied intelligence have largely adopted a generalized yet non-personalized approach. To fill in this gap, our research focuses on empowering embodied agents to learn and adapt to individual preferences, a task complicated by the challenges of inferring these preferences from minimal observations and requiring robust few-shot generalization. To facilitate future study, we introduce PbP, an embodied environment that supports hundreds of diverse preferences ranging from complex action sequences to specific sub-actions. Our experiments on PbP reveal that while symbol-based approaches show promise in terms of effectiveness and scalability, accurately inferring implicit preferences and planning adaptive actions from limited data remain challenging. Nevertheless, preference serves as a valuable abstraction of human behaviors, and incorporating preference as a key intermediary step in planning can significantly enhance the personalization and adaptability of AI agents. We hope our findings can pave the way for future research on more efficient preference learning and personalized planning in dynamic environments.

967Learning Pattern-Specific Experts for Time Series Forecasting Under Patch-level Distribution Shift

[openreview] [pdf]

Abstract Time series forecasting, which aims to predict future values based on historical data, has garnered significant attention due to its broad range of applications. However, real-world time series often exhibit complex non-uniform distribution with varying patterns across segments, such as season, operating condition, or semantic meaning, making accurate forecasting challenging. Existing approaches, which typically train a single model to capture all these diverse patterns, often struggle with the pattern drifts between patches and may lead to poor generalization. To address these challenges, we propose TFPS, a novel architecture that leverages pattern-specific experts for more accurate and adaptable time series forecasting. TFPS employs a dual-domain encoder to capture both time-domain and frequency-domain features, enabling a more comprehensive understanding of temporal dynamics. It then uses subspace clustering to dynamically identify distinct patterns across data patches. Finally, pattern-specific experts model these unique patterns, delivering tailored predictions for each patch. By explicitly learning and adapting to evolving patterns, TFPS achieves significantly improved forecasting accuracy. Extensive experiments on real-world datasets demonstrate that TFPS outperforms state-of-the-art methods, particularly in long-term forecasting, through its dynamic and pattern-aware learning approach. The data and codes are available:https://anonymous.4open.science/r/TFPS-D001.

968Multi-Resolution Decomposable Diffusion Model for Non-Stationary Time Series Anomaly Detection

[openreview] [pdf]

Abstract Recently, generative models have shown considerable promise in unsupervised time series anomaly detection. Nonetheless, the task of effectively capturing complex temporal patterns and minimizing false alarms becomes increasingly challenging when dealing with non-stationary time series, characterized by continuously fluctuating statistical attributes and joint distributions. To confront these challenges, we underscore the benefits of multi-resolution modeling, which improves the ability to distinguish between anomalies and non-stationary behaviors by leveraging correlations across various resolution scales. In response, we introduce aMulti-ResolutionDecomposable DiffusionModel (MODEM), which integrates a coarse-to-fine diffusion paradigm with a frequency-enhanced decomposable network to adeptly navigate the intricacies of non-stationarity. Technically, the coarse-to-fine diffusion model embeds cross-resolution correlations into the forward process to optimize diffusion transitions mathematically. It then innovatively employs low-resolution recovery to guide the reverse trajectories of high-resolution series in a coarse-to-fine manner, enhancing the model’s ability to learn and elucidate underlying temporal patterns. Furthermore, the frequency-enhanced decomposable network operates in the frequency domain to extract globally shared time-invariant information and time-variant temporal dynamics for accurate series reconstruction. Extensive experiments conducted across five real-world datasets demonstrate that our proposed MODEM achieves state-of-the-art performance and can be generalized to other time series tasks. The code will be publicly available upon acceptance.

969Multiple-Frequencies Population-Based Training

[openreview] [pdf]

Abstract Reinforcement Learning’s high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules in a single training run. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice ofevolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without tuning hyperparameters. Code will be released.

970Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

[openreview] [pdf]

Abstract Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing methods have used offline data or online data to generate new data for data augmentation, which has led to performance improvement during online fine-tuning. However, they have not fully analyzed and utilized both types of data simultaneously. Offline data helps prevent agents from settling too early on suboptimal policies by providing diverse data, while online data improves training stability and speeds up convergence. In this paper, we propose a data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Considering the differences between offline data and online data, we use conditional diffusion to generate both types of data for augmentation in the online phase, aiming to improve the quality of sample generation. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark like MuJoCo and AntMaze.

971Auditing Data Controller Compliance with Data Withdrawal

[openreview] [pdf]

Abstract We study auditing total data withdrawal, the case in which a user requests the exclusion of their data from both the training and test data for some machine learning task. This approach is motivated by the need for comprehensive compliance with data privacy regulations and legal frameworks around the world. We conceptualize the task of auditing total data withdrawal as an optimization problem. Compliance verification is conducted under mild assumptions using a dedicated verification algorithm. We then evaluate this formulation over various datasets, architectures, and verification hyperparameters. Our verification algorithm serves as a tool for regulators to ensure auditable compliance and provides enhanced privacy guarantees for users.

972Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models

[openreview] [pdf]

Abstract Membership inference attacks (MIAs) on diffusion models have emerged as potential evidence of unauthorized data usage in training pre-trained diffusion models. These attacks aim to detect the presence of specific images in training datasets of diffusion models. Our study delves into the evaluation of state-of-the-art MIAs on diffusion models and reveals critical flaws and overly optimistic performance estimates in existing MIA evaluation. We introduce CopyMark, a more realistic MIA benchmark that distinguishes itself through the support for pre-trained diffusion models, unbiased datasets, and fair evaluation pipelines. Through extensive experiments, we demonstrate that the effectiveness of current MIA methods significantly degrades under these more practical conditions. Based on our results, we alert that MIA, in its current state, is not a reliable approach for identifying unauthorized data usage in pre-trained diffusion models. To the best of our knowledge, we are the first to discover the performance overestimation of MIAs on diffusion models and present a unified benchmark for more realistic evaluation.

973Online Decision Deferral under Budget Constraints

[openreview] [pdf]

Abstract Machine Learning (ML) models are increasingly used to support or substitute decision making. In applications where skilled experts are a limited resource, it is crucial to reduce their burden and automate decisions when the performance of an ML model is at least of equal quality. However, models are often pre-trained and fixed, while tasks arrive sequentially and their distribution may shift. In that case, the respective performance of the decision makers may change, and the deferral algorithm must remain adaptive. We propose a contextual bandit model of this online decision making problem. Our framework includes budget constraints and different types of partial feedback models. Beyond the theoretical guarantees of our algorithm, we propose efficient extensions that achieve remarkable performance on real-world datasets.

974Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

[openreview] [pdf]

Abstract Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we enhance the stability of the RL training procedure by adapting reverse cross-entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise. Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks, demonstrating improved performance in tasks such as IMDB positive sentiment and TL;DR summarization.

975Gradient based Causal Discovery with Diffusion Model

[openreview] [pdf]

Abstract Causal discovery from observational data is an important problem in many applied sciences. Incorporating a recently proposed smooth characterization of acyclicity, gradient-based causal discovery approaches search for a Directed Acyclic Graph (DAG) by optimizing various neural models. Although they show some inspiring results given certain assumptions satisfied, their capability of modeling complex nonlinear causal generative functions is still unsatisfactory. Motivated by recent advances in deep generative models, we propose to use diffusion models for causal discovery, and search for the DAG under continuous optimization frameworks. With flexible parameter configurations, diffusion model has the ability to represent various functions, and the proposed causal discovery approach are able to generate graphs with satisfactory accuracy on observational data generated by either linear or nonlinear causal models. This is evidenced by empirical results on both synthetic and real data.

976Zero-Order Diffusion Guidance for Inverse Problems

[openreview] [pdf]

Abstract We propose zero order diffusion guidance, a method that allows using a diffusion model to solve inverse problems without access to the gradients of the process we seek to invert. Our method employs a zero-order gradient estimator combined with a novel differentiable dimensionality reduction strategy to approximate true gradients during guidance while keeping the task computationally tractable in thousands of dimensions. We apply our method to model inversion and demonstrate how it can be used to reconstruct high-quality faces in a realistic scenario where the adversary has only black-box access to face embeddings. Across a range of inverse problems—including synthetic experiments and JPEG restoration—we show that access to gradients is not necessary for effective guidance. Our black-box method matches white-box performance, thus expanding the scope of inverse problems that can be solved with diffusion-based approaches.

977Extending Stability Analysis to Adaptive Optimization Algorithms Using Loss Surface Geometry

[openreview] [pdf]

Abstract Adaptive optimization algorithms, such as Adam Kingma & Ba (2015) and RM-SProp Tieleman & Hinton (2012), have become integral to training deep neu-ral networks, yet their stability properties and impact on generalization remain poorly understood Wilson et al. (2017). This paper extends linear stability anal-ysis to adaptive optimizers, providing a theoretical framework that explains their behavior in relation to loss surface geometry Wu et al. (2022); Jastrz˛ebski et al.(2019). We introduce a novel generalized coherence measure that quantifies the interaction between the adaptive preconditioner and the Hessian of the loss func-tion. This measure yields necessary and sufficient conditions for linear stability near stationary points, offering insights into why adaptive methods may converge to sharper minima with poorer generalization. Our analysis leads to practical guidelines for hyperparameter tuning, demon-strating how to improve the generalization performance of adaptive optimizers. Through extensive experiments on benchmark datasets and architectures, includ-ing ResNet He et al. (2016) and Vision Transformers Dosovitskiy et al. (2020), we validate our theoretical predictions, showing that aligning the adaptive precon-ditioner with the loss surface geometry through careful parameter selection can narrow the generalization gap between adaptive methods and SGD Loshchilov & Hutter (2018).

978Adaptive Algorithm for Non-Stationary Online Convex-Concave Optimization

[openreview] [pdf]

Abstract This paper addresses the problem of Online Convex-Concave Optimization, an extension of Online Convex Optimization to two-player time-varying convex-concave games. Our objective is to minimize the dynamic duality gap (D-DGap), a key performance metric that evaluates the players’ strategies against arbitrary comparator sequences. Existing algorithms struggle to achieve optimal performance, particularly in stationary or predictable environments. We propose a novel, modular algorithm comprising three key components: an Adaptive Module that adjusts to varying levels of non-stationarity, a Multi-Predictor Aggregator that selects the optimal predictor from multiple candidates, and an Integration Module that seamlessly combines the strengths of both. Our algorithm guarantees a minimax optimal D-DGap upper bound, up to a logarithmic factor, while also achieving a prediction error-based D-DGap bound. Empirical results further demonstrate the effectiveness and adaptability of the proposed method.

979Four eyes see more than two: Dataset Distillation with Mixture-of-Experts

[openreview] [pdf]

Abstract The ever-growing size of datasets in deep learning presents a significant challenge in terms of training efficiency and computational cost. Dataset distillation (DD) has emerged as a promising approach to address this challenge by generating compact synthetic datasets that retain the essential information of the original data. However, existing DD methods often suffer from performance degradation when transferring distilled datasets across different network architectures (i.e. the model utilizing distilled dataset for further training is different from the one used in dataset distillation). To overcome this limitation, we propose a novel mixture-of-experts framework for dataset distillation. Our goal focuses on promoting diversity within the distilled dataset by distributing the distillation tasks to multiple expert models. Each expert specializes in distilling a distinct subset of the dataset, encouraging them to capture different aspects of the original data distribution. To further enhance diversity, we introduce a distance correlation minimization strategy to encourage the experts to learn distinct representations. Moreover, during the testing stage (where the distilled dataset is used for training a new model), the mixup-based fusion strategy is applied to better leverage the complementary information captured by each expert. Through extensive experiments, we demonstrate that our framework effectively mitigates the issue of cross-architecture performance degradation in dataset distillation, particularly in low-data regimes, leading to more efficient and versatile deep learning models while being trained upon the distilled dataset.

980Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

[openreview] [pdf]

Abstract Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance.

981Decision Information Meets Large Language Models: The Future of Explainable Operations Research

[openreview] [pdf]

Abstract Operations Research (OR) is vital for decision-making in many industries. While recent OR methods have seen significant improvements in automation and efficiency through integrating Large Language Models (LLMs), they still struggle to produce meaningful explanations. This lack of clarity raises concerns about transparency and trustworthiness in OR applications. To address these challenges, we propose a comprehensive framework, Explainable Operations Research (EOR), emphasizing actionable and understandable explanations accompanying optimization. The core of EOR is the concept of Decision Information, which emerges from what-if analysis and focuses on evaluating the impact of complex constraints (or parameters) changes on decision-making. Specifically, we utilize bipartite graphs to quantify the changes in the OR model and adopt LLMs to improve the explanation capabilities. Additionally, we introduce the first industrial benchmark to rigorously evaluate the effectiveness of explanations and analyses in OR, establishing a new standard for transparency and clarity in the field.

982Towards Infinite-Long Prefix in Transformer

[openreview] [pdf]

Abstract Prompting and context-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks. They are empirically efficient and effective, matching the performance of full parameter fine-tuning, but the theoretical understandings are limited. In this paper, we aim to address this limitation by studying their ability from the perspective of prefix length. In particular, we provide a convergence guarantee for training an ultra-long prefix in a stylized setting using the Neural Tangent Kernel (NTK) framework. Based on this strong theoretical guarantee, we design and implement an algorithm that only needs to introduce and fine-tune a few extra trainable parameters instead of an infinite-long prefix in each layer of a transformer, and can approximate the prefix attention to a guaranteed polynomial-small error. Preliminary experimental results on vision, natural language, and math data show that our method achieves superior or competitive performance compared to existing methods like full parameters fine-tuning, P-Tuning V2, and LoRA. This demonstrates our method is promising for parameter-efficient fine-tuning.

983Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

[openreview] [pdf]

Abstract Recent studies demonstrate that diffusion models can serve as a strong prior for solving inverse problems. A prominent example is Diffusion Posterior Sampling (DPS), which approximates the posterior distribution of data given the measure using Tweedie’s formula. Despite the merits of being versatile in solving various inverse problems without re-training, the performance of DPS is hindered by the fact that this posterior approximation can be inaccurate especially for high noise levels. Therefore, we propose Diffusion Posterior MCMC (DPMC), a novel inference algorithm based on Annealed MCMC to solve inverse problems with pretrained diffusion models. We define a series of intermediate distributions inspired by the approximated conditional distributions used by DPS. Through annealed MCMC sampling, we encourage the samples to follow each intermediate distribution more closely before moving to the next distribution at a lower noise level, and therefore reduce the accumulated error along the path. We test our algorithm in various inverse problems, including super resolution, Gaussian deblurring, motion deblurring, inpainting, and phase retrieval. Our algorithm outperforms DPS with less number of evaluations across nearly all tasks, and is competitive among existing approaches.

984Grounding Video Models to Actions through Goal Conditioned Exploration

[openreview] [pdf]

Abstract Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data is available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

985Mitigating Overestimation in Offline Reinforcement Learning with Anomaly Detection

[openreview] [pdf]

Abstract Reinforcement Learning (RL) encounters substantial challenges in real-world applications, due to the time-consuming, costly, and risky nature of interacting with the environment. Offline Reinforcement Learning addresses this limitation by training models on static datasets, allowing an optimal policy to be learned from pre-collected data without requiring additional interactions with the environment. However, in this setting, when the agent queries actions outside the training data distribution, it can lead to overestimation of Q-values for OOD (Out-of-distribution) actions, ultimately hindering policy optimization. Previous works attempted to address this problem using explicit constraints such as penalty terms or support restriction. But these methods often fail to identify OOD actions or result in overly conservative Q-value estimates. We propose a novel solution that adjusts weights during training by using an anomaly detection model to identify the distribution of the offline dataset and employing anomaly scores to guide the offline RL process. Our method(RLAD) not only effectively mitigates the overestimation of OOD actions but also achieves near state-of-the-art performance on continuous D4RL tasks. Additionally, this framework is highly flexible, allowing for integration with various off-policy or offline RL algorithms and Anomaly Detection models to enhance performance.

986ENHANCING DIVERSITY AND ACCURACY IN PERSONALIZED TAG RECOMMENDATIONS: A HYBRID SEMANTIC AND CONTEXTUAL ANALYSIS APPROACH

[openreview] [pdf]

Abstract This paper introduces HYCOMB, a cascading Hybrid model that innovatively integrates Collaborative Filtering (CF), Content-Based Filtering (CB), and Context- Aware (CA) methods to address the challenge of data sparsity in tag recommendation systems. Unlike traditional models that rely heavily on user-item interactions, HYCOMB enhances recommendation diversity and interpretability by utilizing semantic clustering in CF to extract and analyze user sentiment from tags, adding a layer of nuanced understanding often missing in conventional systems. The CB component advances this by applying sophisticated NLP techniques to refine these recommendations based on item attributes, while the CA component incorporates movie synopses for deeper contextual understanding. Developed and tested using the MovieLens 20M dataset, our model demonstrates significant outperformance over baseline methods in terms of precision and recall, achieving scores of 0.813 and 0.364 respectively. Further, a newly introduced Overall Total Similarity metric that underscores its ability to deliver relevant and diverse recommendations. HYCOMB’s strategic amalgamation of CF, CB, and CA not only mitigates the effects of sparse data but also improves the precision and diversity of tag recommendations, reflecting a more accurate alignment with user preferences.

987FEDERATED COMPOSITIONAL OPTIMIZATION: THE IMPACT OF TWO-SIDED LEARNING RATES ON COMMUNICATION EFFICIENCY

[openreview] [pdf]

Abstract Compositional optimization (CO) has recently gained popularity due to its applications in distributionally robust optimization (DRO), meta-learning, reinforcement learning, and many other machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical) or suffer from suboptimal communication efficiency. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis establishes that either {\em additional communication} or {\em two-sided learning rate-based} algorithms are required to control this bias. To this end, we develop two algorithms for solving the federated CO problem. First, we propose FedDRO that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to control the bias in the estimation of the compositional gradient, achieving O(ϵ2)\mathcal{O}(\epsilon^{-2}) sample and O(ϵ3/2)\mathcal{O}(\epsilon^{-3/2}) communication complexity. Then we propose DS-FedDRO, a two-sided learning rate algorithm, that eliminates the need for additional communication and achieves the optimal O(ϵ2)\mathcal{O}(\epsilon^{-2}) sample and O(ϵ1)\mathcal{O}(\epsilon^{-1}) communication complexity, highlighting the importance of two-sided learning rate algorithms for solving federated CO problems. The proposed algorithms avoid the need for large batch gradients and achieve linear speedup with the number of clients. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.

988Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

[openreview] [pdf]

Abstract Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See ourproject pagefor videos.

989Balanced Hyperbolic Embeddings Are Natural Out-of-Distribution Detectors

[openreview] [pdf]

Abstract Out-of-distribution recognition forms an important and well-studied problem in computer vision, with the goal to filter out samples that do not belong to the distribution on which a network has been trained. The conclusion of this paper is simple: a good hierarchical hyperbolic embedding is preferred for discriminating in- and out-of-distribution samples. We introduce Balanced Hyperbolic Learning. We outline a hyperbolic class embedding algorithm that jointly optimizes for hierarchical distortion and balancing between shallow and wide subhierarchies. We can then use the class embeddings as hyperbolic prototypes for classification on in-distribution data. We outline how existing out-of-distribution scoring functions can be generalized to operate with hyperbolic prototypes. Empirical evaluations across 13 datasets and 13 scoring functions show that our hyperbolic embeddings outperform existing out-of-distribution approaches when trained on the same data with the same backbones. We also show that our hyperbolic embeddings outperform other hyperbolic approaches and naturally enable hierarchical out-of-distribution generalization.

990Understanding Synthetic Context Extension via Retrieval Heads

[openreview] [pdf]

Abstract Long-context LLMs are increasingly desired for a broad set of applications such as retrieval-augmented generation. The high cost for pretraining LLMs over long contexts has led to exploration of fine-tuning LLMs with synthetically generated data in a post-training stage. However, it remains unclear how and why fine-tuning on synthetic data transfers to long-context performance on realistic tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We explore synthetic data variants from the literature by varying the realism of the concept expression and context diversity of the data. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data are mostly subsets of the retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are not only necessary, but also provide fine-grained explanations for the performance gap between fine-tuning on synthetic and real data. Our results shed light on how to interpret the success and failure of synthetic data fine-tuning and how to create better synthetic data that can be transferred to realistic capabilities over long context.

991The Vital Role of Gradient Clipping in Byzantine-Resilient Distributed Learning

[openreview] [pdf]

Abstract Byzantine-resilient distributed machine learning seeks to achieve robust learning performance in the presence of misbehaving or adversarial workers. While state-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods were proven theoretically optimal, their empirical success has often relied on pre-aggregation gradient clipping. However, the currently considered static clipping strategy exhibits mixed results: improving robustness against some attacks while being ineffective or detrimental against others. We address this gap by proposing a principled adaptive clipping strategy, termed Adaptive Robust Clipping (ARC). We show that ARC consistently enhances the empirical robustness of SOTA Robust-DGD methods, while preserving the theoretical robustness guarantees. Our analysis shows that ARC provably improves the asymptotic convergence guarantee of Robust-DGD in the case when the model is well-initialized. We validate this theoretical insight through an exhaustive set of experiments on benchmark image classification tasks. We observe that the improvement induced by ARC is more pronounced in highly heterogeneous and adversarial settings.

992Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

[openreview] [pdf]

Abstract In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.

993On Learning Representations for Tabular Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present TDColER, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, TDBench. Based on an elaborate evaluation on TDBench, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that TDColER is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

994Decoupling Backdoors from Main Task: Toward the Effective and Durable Backdoors in Federated Learning

[openreview] [pdf]

Abstract Federated learning, as a distributed machine learning method, enables multiple participants to collaboratively train a central model without sharing their private data. However, this decentralized mechanism introduces new privacy and security concerns. Malicious attackers can embed backdoors into local models, which are inherited by the central global model through the federated aggregation process. While previous studies have demonstrated the effectiveness of backdoor attacks, the effectiveness and durability often rely on unrealistic assumptions, such as a large number of attackers and scaled malicious contributions. These assumptions arise because a sufficient number of attackers can neutralize the contributions of honest participants, allowing the backdoor to be successfully inherited by the central model. In this work, we attribute these backdoor limitations to the coupling between the main and backdoor tasks. To address these backdoor limitations, we propose a min-max backdoor attack framework that decouples backdoors from the main task, ensuring that these two tasks do not interfere with each other. The maximization phase employs the principle of universal adversarial perturbation to create triggers that amplify the performance disparity between poisoned and benign samples. These samples are then used to train a backdoor model in the minimization process. We evaluate the proposed framework in both image classification and semantic analysis tasks. Comparisons with four backdoor attack methods under five defense algorithms show that our method achieves good attack performance even if there is a small number of attackers and when the submitted model parameters are not scaled. In addition, even if attackers are completely removed in the training process, the implanted backdoors will not be dramatically weakened by the contributions of other honest participants.

995Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning

[openreview] [pdf]

Abstract Prompt tuning vision-language models like CLIP has shown great potential in learning transferable representations for various downstream tasks. The main issue is how to mitigate the over-fitting problem on downstream tasks with limited training samples. While knowledge-guided context optimization (Yao et al.,2023; 2024) has been proposed by constructing consistency constraints to handle catastrophic forgetting in the pre-trained backbone, it also introduces a potential bias toward pre-training. This paper proposes a novel and simple Divergence-enhanced Knowledge-guided Prompt Tuning (DeKg) method to address this issue. The key insight is that the bias toward pre-training can be alleviated by encouraging the independence between the learnable and the crafted prompt. Specifically, DeKg employs the Hilbert-Schmidt Independence Criterion (HSIC) to regularize the learnable prompts, thereby reducing their dependence on prior general knowledge, and enabling divergence induced by target knowledge. Comprehensive evaluations demonstrate that DeKg serves as a plug-and-play module can seamlessly integrate with existing knowledge-guided methods and achieves superior performance in three challenging benchmarks.

996Editable Concept Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.

997Diffusion Attacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

[openreview] [pdf]

Abstract Large Language Models can generate harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreaking becomes a critical aspect of enhancing security and human value alignment. Currently, jailbreak is usually implemented by adding suffixes or using prompt templates, which suffers from low attack diversity. Inspired by diffusion models, this paper introduces the DiffusionAttacker, an end-to-end generative method for jailbreak rewriting. Our approach employs a seq2seq text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. This method preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the output distribution of the diffusion model differentiable, thereby eliminating the need for an iterative token search. Through extensive experiments on the Advbench and Harmbench, we show that DiffusionAttacker outperforms previous methods in various evaluation indicators including attack success rate (ASR), fluency, and diversity.

998Bad-PFL: Exploiting Backdoor Attacks against Personalized Federated Learning

[openreview] [pdf]

Abstract Data heterogeneity and backdoor attacks rank among the most significant challenges facing federated learning (FL). For data heterogeneity, personalized federated learning (PFL) enables each client to maintain a private personalized model to cater to client-specific knowledge. Meanwhile, vanilla FL has proven vulnerable to backdoor attacks. However, recent advancements in PFL community have demonstrated a potential immunity against such attacks. This paper explores this intersection further, revealing that existing federated backdoor attacks fail in PFL because backdoors about manually designed triggers struggle to survive in personalized models. To tackle this, we degisn Bad-PFL, which employs features from natural data as our trigger. As long as the model is trained on natural data, it inevitably embeds the backdoor associated with our trigger, ensuring its longevity in personalized models. Moreover, our trigger undergoes mutual reinforcement training with the model, further solidifying the backdoor’s durability and enhancing attack effectiveness. The large-scale experiments across three benchmark datasets demonstrate the superior performance of our attack against various PFL methods, even when equipped with state-of-the-art defense mechanisms.

999Efficient Predictive Counterfactual Regret Minimization+Algorithm in Solving Extensive-Form Games

[openreview] [pdf]

Abstract Imperfect-information extensive-form games (IIGs) serve as a foundational model for capturing interactions among multiple agents in sequential settings with hidden information. A common objective of IIGs is to calculate a Nash equilibrium (NE). Counterfactual Regret Minimization (CFR) algorithms have been widely developed to learn an NE in two-player zero-sum IIGs. Among CFR algorithms, Predictive CFR+ (PCFR+) is powerful, usually achieving an extremely fast empirical convergence rate. However, PCFR+ suffers from the significant discrepancy between strategies represented by explicit accumulated counterfactual regrets across two consecutive iterations, which decreases the empirical convergence rate of PCFR+ in practice. To mitigate this significant discrepancy, we introduce a novel and effective variant of PCFR+, termed Pessimistic PCFR+ (P2PCFR+), minimizing the discrepancy between strategies represented by implicit and explicit accumulated regrets within the same iteration. We provide theoretical proof to show that P2PCFR+ exhibits a faster theoretical convergence rate than PCFR+. Experimental results demonstrate that P2PCFR+ outperforms other tested CFR variants.

1000Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs

[openreview] [pdf]

Abstract Multimodal large language models (MLLMs) offer a powerful mechanism for interpreting visual information. However, they often suffer from hallucinations, which impede the real-world usage of these models. Existing methods attempt to alleviate this issue by designing special decoding strategies that penalize the summary tokens. However, these methods lack analysis of the relationship between hallucination and summarization mechanism of LLMs. Interestingly, we find that penalizing summary tokens is not necessary: merely intervening the query-key parameters variance, without costing extra inference time, still alleviates hallucinations. Specifically, we explore the causes of hallucinations by analyzing localized self-attention patterns called ``anchor" tokens and define the attention localization degree of the model as token propagation probabilities. Our analysis reveals that over-propagation of anchor tokens occurs when the distribution of eigenvalues of the query and key matrices has a non-zero mean and a polarized variance, leading to excessive dependence on anchor tokens while neglecting vision information and describes the image content with hallucination. Based on the observation, we propose a versatile plug-and-play decoding strategy, Dynamic Token Propagation Mechanism (TAME), to alleviate excessive propagation by dynamically intervening the eigenspectrum variance of the attention weight, thereby alleviating hallucinations without relying on complex decoding strategies. Extensive experiments reveal a correlation between the eigenspectrum and hallucinations across various MLLMs, and show that TAME reduces the percentage of hallucinated objects.