Generated at 2025-06-25 05:03:26
We have 142 news from different sources.
2feed¶
2.1Cache Me If You Can:陈丹琦团队如何「抓住」关键缓存,解放LLM内存?¶
2025/06/24 14:08 GTM
该度量标准揭示了先前KV驱逐方法存在的高峰值内存问题。
2.2ToMAP:赋予大模型「读心术」,打造更聪明的AI说服者¶
2025/06/24 14:08 GTM
让 AI 更能「设身处地」从对方的角度思考,从而实现更具个性化、灵活性和逻辑性的说服过程。
2.3讲得了课、押得中题、学习规划还能量身定制,真卷到点子上的只有它¶
2025/06/24 14:08 GTM
打败孩子的不是AI,是那些善于利用AI的孩子。
3paper¶
3.1SceneCrafter: Controllable Multi-View Driving Scene Editing¶
2025/06/24 10:23 GTM
Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.
3.2AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models¶
2025/06/24 17:59 GTM
We present AnimaX, a feed-forward 3D animation framework that bridges the
motion priors of video diffusion models with the controllable structure of
skeleton-based animation. Traditional motion synthesis methods are either
restricted to fixed skeletal topologies or require costly optimization in
high-dimensional deformation spaces. In contrast, AnimaX effectively transfers
video-based motion knowledge to the 3D domain, supporting diverse articulated
meshes with arbitrary skeletons. Our method represents 3D motion as multi-view,
multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on
template renderings and a textual motion prompt. We introduce shared positional
encodings and modality-aware embeddings to ensure spatial-temporal alignment
between video and pose sequences, effectively transferring video priors to
motion generation task. The resulting multi-view pose sequences are
triangulated into 3D joint positions and converted into mesh animation via
inverse kinematics. Trained on a newly curated dataset of 160,000 rigged
sequences, AnimaX achieves state-of-the-art results on VBench in
generalization, motion fidelity, and efficiency, offering a scalable solution
for category-agnostic 3D animation. Project page:
\href{https://
3.3Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation¶
2025/06/24 06:20 GTM
Distilled video generation models offer fast and efficient synthesis but
struggle with motion customization when guided by reference videos, especially
under training-free settings. Existing training-free methods, originally
designed for standard diffusion models, fail to generalize due to the
accelerated generative process and large denoising steps in distilled models.
To address this, we propose MotionEcho, a novel training-free test-time
distillation framework that enables motion customization by leveraging
diffusion teacher forcing. Our approach uses high-quality, slow teacher models
to guide the inference of fast student models through endpoint prediction and
interpolation. To maintain efficiency, we dynamically allocate computation
across timesteps according to guidance needs. Extensive experiments across
various distilled video generation models and benchmark datasets demonstrate
that our method significantly improves motion fidelity and generation quality
while preserving high efficiency. Project page:
https://
3.4GenHSI: Controllable Generation of Human-Scene Interaction Videos¶
2025/06/24 17:58 GTM
Large-scale pre-trained video diffusion models have exhibited remarkable
capabilities in diverse video generation. However, existing solutions face
several challenges in using these models to generate long movie-like videos
with rich human-object interactions that include unrealistic human-scene
interaction, lack of subject identity preservation, and require expensive
training. We propose GenHSI, a training-free method for controllable generation
of long human-scene interaction videos (HSI). Taking inspiration from movie
animation, our key insight is to overcome the limitations of previous work by
subdividing the long video generation task into three stages: (1) script
writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a
user description, and multiple images of a person, we use these three stages to
generate long-videos that preserve human-identity and provide rich human-scene
interactions. Script writing converts complex human tasks into simple atomic
tasks that are used in the pre-visualization stage to generate 3D keyframes
(storyboards). These 3D keyframes are rendered and animated by off-the-shelf
video diffusion models for consistent long video generation with rich contacts
in a 3D-aware manner. A key advantage of our work is that we alleviate the need
for scanned, accurate scenes and create 3D keyframes from single-view images.
We are the first to generate a long video sequence with a consistent camera
pose that contains arbitrary numbers of character actions without training.
Experiments demonstrate that our method can generate long videos that
effectively preserve scene content and character identity with plausible
human-scene interaction from a single image scene. Visit our project homepage
https://
3.5CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation¶
2025/06/24 17:30 GTM
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.
3.6CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation¶
2025/06/24 17:30 GTM
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.
3.7Deblurring in the Wild: A Real-World Dataset from Smartphone High-Speed Videos¶
2025/06/24 09:17 GTM
We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.
3.8SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution¶
2025/06/24 17:57 GTM
Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.
3.9Trajectory Prediction in Dynamic Object Tracking: A Critical Study¶
2025/06/24 06:10 GTM
This study provides a detailed analysis of current advancements in dynamic object tracking (DOT) and trajectory prediction (TP) methodologies, including their applications and challenges. It covers various approaches, such as feature-based, segmentation-based, estimation-based, and learning-based methods, evaluating their effectiveness, deployment, and limitations in real-world scenarios. The study highlights the significant impact of these technologies in automotive and autonomous vehicles, surveillance and security, healthcare, and industrial automation, contributing to safety and efficiency. Despite the progress, challenges such as improved generalization, computational efficiency, reduced data dependency, and ethical considerations still exist. The study suggests future research directions to address these challenges, emphasizing the importance of multimodal data integration, semantic information fusion, and developing context-aware systems, along with ethical and privacy-preserving frameworks.
3.10Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment¶
2025/06/24 17:59 GTM
We tackle active view selection in novel view synthesis and 3D reconstruction. Existing methods like FisheRF and ActiveNeRF select the next best view by minimizing uncertainty or maximizing information gain in 3D, but they require specialized designs for different 3D representations and involve complex modelling in 3D space. Instead, we reframe this as a 2D image quality assessment (IQA) task, selecting views where current renderings have the lowest quality. Since ground-truth images for candidate views are unavailable, full-reference metrics like PSNR and SSIM are inapplicable, while no-reference metrics, such as MUSIQ and MANIQA, lack the essential multi-view context. Inspired by a recent cross-referencing quality framework CrossScore, we train a model to predict SSIM within a multi-view setup and use it to guide view selection. Our cross-reference IQA framework achieves substantial quantitative and qualitative improvements across standard benchmarks, while being agnostic to 3D representations, and runs 14-33 times faster than previous methods.
3.11ConCM: Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning¶
2025/06/24 12:12 GTM
Few-Shot Class-Incremental Learning (FSCIL) requires models to adapt to novel classes with limited supervision while preserving learned knowledge. Existing prospective learning-based space construction methods reserve space to accommodate novel classes. However, prototype deviation and structure fixity limit the expressiveness of the embedding space. In contrast to fixed space reservation, we explore the optimization of feature-structure dual consistency and propose a Consistency-driven Calibration and Matching Framework (ConCM) that systematically mitigate the knowledge conflict inherent in FSCIL. Specifically, inspired by hippocampal associative memory, we design a memory-aware prototype calibration that extracts generalized semantic attributes from base classes and reintegrates them into novel classes to enhance the conceptual center consistency of features. Further, we propose dynamic structure matching, which adaptively aligns the calibrated features to a session-specific optimal manifold space, ensuring cross-session structure consistency. Theoretical analysis shows that our method satisfies both geometric optimality and maximum matching, thereby overcoming the need for class-number priors. On large-scale FSCIL benchmarks including mini-ImageNet and CUB200, ConCM achieves state-of-the-art performance, surpassing current optimal method by 3.20% and 3.68% in harmonic accuracy of incremental sessions.
3.12Emergence of Text Readability in Vision Language Models¶
2025/06/24 07:35 GTM
We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research on optimizing multimodal learning.
3.13Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images¶
2025/06/24 16:05 GTM
Fisheye cameras offer robots the ability to capture human movements across a
wider field of view (FOV) than standard pinhole cameras, making them
particularly useful for applications in human-robot interaction and automotive
contexts. However, accurately detecting human poses in fisheye images is
challenging due to the curved distortions inherent to fisheye optics. While
various methods for undistorting fisheye images have been proposed, their
effectiveness and limitations for poses that cover a wide FOV has not been
systematically evaluated in the context of absolute human pose estimation from
monocular fisheye images. To address this gap, we evaluate the impact of
pinhole, equidistant and double sphere camera models, as well as cylindrical
projection methods, on 3D human pose estimation accuracy. We find that in
close-up scenarios, pinhole projection is inadequate, and the optimal
projection method varies with the FOV covered by the human pose. The usage of
advanced fisheye models like the double sphere model significantly enhances 3D
human pose estimation accuracy. We propose a heuristic for selecting the
appropriate projection model based on the detection bounding box to enhance
prediction quality. Additionally, we introduce and evaluate on our novel
dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye
images, including images from unconventional angles, such as extreme close-ups,
ground-mounted cameras, and wide-FOV poses, available at:
https://
3.14Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images¶
2025/06/24 16:05 GTM
Fisheye cameras offer robots the ability to capture human movements across a
wider field of view (FOV) than standard pinhole cameras, making them
particularly useful for applications in human-robot interaction and automotive
contexts. However, accurately detecting human poses in fisheye images is
challenging due to the curved distortions inherent to fisheye optics. While
various methods for undistorting fisheye images have been proposed, their
effectiveness and limitations for poses that cover a wide FOV has not been
systematically evaluated in the context of absolute human pose estimation from
monocular fisheye images. To address this gap, we evaluate the impact of
pinhole, equidistant and double sphere camera models, as well as cylindrical
projection methods, on 3D human pose estimation accuracy. We find that in
close-up scenarios, pinhole projection is inadequate, and the optimal
projection method varies with the FOV covered by the human pose. The usage of
advanced fisheye models like the double sphere model significantly enhances 3D
human pose estimation accuracy. We propose a heuristic for selecting the
appropriate projection model based on the detection bounding box to enhance
prediction quality. Additionally, we introduce and evaluate on our novel
dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye
images, including images from unconventional angles, such as extreme close-ups,
ground-mounted cameras, and wide-FOV poses, available at:
https://
3.15Improving Progressive Generation with Decomposable Flow Matching¶
2025/06/24 17:58 GTM
Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.
3.16T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models¶
2025/06/24 10:36 GTM
Building a general robotic manipulation system capable of performing a wide variety of tasks in real-world settings is a challenging task. Vision-Language Models (VLMs) have demonstrated remarkable potential in robotic manipulation tasks, primarily due to the extensive world knowledge they gain from large-scale datasets. In this process, Spatial Representations (such as points representing object positions or vectors representing object orientations) act as a bridge between VLMs and real-world scene, effectively grounding the reasoning abilities of VLMs and applying them to specific task scenarios. However, existing VLM-based robotic approaches often adopt a fixed spatial representation extraction scheme for various tasks, resulting in insufficient representational capability or excessive extraction time. In this work, we introduce T-Rex, a Task-Adaptive Framework for Spatial Representation Extraction, which dynamically selects the most appropriate spatial representation extraction scheme for each entity based on specific task requirements. Our key insight is that task complexity determines the types and granularity of spatial representations, and Stronger representational capabilities are typically associated with Higher overall system operation costs. Through comprehensive experiments in real-world robotic environments, we show that our approach delivers significant advantages in spatial understanding, efficiency, and stability without additional training.
3.17Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls¶
2025/06/24 15:58 GTM
The pursuit of efficient and controllable high-quality content generation
remains a central challenge in artificial intelligence-generated content
(AIGC). While one-step generators, enabled by diffusion distillation
techniques, offer excellent generation quality and computational efficiency,
adapting them to new control conditions--such as structural constraints,
semantic guidelines, or external inputs--poses a significant challenge.
Conventional approaches often necessitate computationally expensive
modifications to the base model and subsequent diffusion distillation. This
paper introduces Noise Consistency Training (NCT), a novel and lightweight
approach to directly integrate new control signals into pre-trained one-step
generators without requiring access to original training images or retraining
the base diffusion model. NCT operates by introducing an adapter module and
employs a noise consistency loss in the noise space of the generator. This loss
aligns the adapted model’s generation behavior across noises that are
conditionally dependent to varying degrees, implicitly guiding it to adhere to
the new control. Theoretically, this training objective can be understood as
minimizing the distributional distance between the adapted generator and the
conditional distribution induced by the new conditions. NCT is modular,
data-efficient, and easily deployable, relying only on the pre-trained one-step
generator and a control signal model. Extensive experiments demonstrate that
NCT achieves state-of-the-art controllable generation in a single forward pass,
surpassing existing multi-step and distillation-based methods in both
generation quality and computational efficiency. Code is available at
https://
3.18Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation¶
2025/06/24 16:39 GTM
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.
3.19Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router¶
2025/06/24 17:50 GTM
Recent years have witnessed remarkable advances in audio-driven talking head
generation. However, existing approaches predominantly focus on
single-character scenarios. While some methods can create separate conversation
videos between two individuals, the critical challenge of generating unified
conversation videos with multiple physically co-present characters sharing the
same spatial environment remains largely unaddressed. This setting presents two
key challenges: audio-to-character correspondence control and the lack of
suitable datasets featuring multi-character talking videos within the same
scene. To address these challenges, we introduce Bind-Your-Avatar, an
MM-DiT-based model specifically designed for multi-talking-character video
generation in the same scene. Specifically, we propose (1) A novel framework
incorporating a fine-grained Embedding Router that binds who' and
speak what’
together to address the audio-to-character correspondence control. (2) Two
methods for implementing a 3D-mask embedding router that enables frame-wise,
fine-grained control of individual characters, with distinct loss functions
based on observed geometric priors and a mask refinement strategy to enhance
the accuracy and temporal smoothness of the predicted masks. (3) The first
dataset, to the best of our knowledge, specifically constructed for
multi-talking-character video generation, and accompanied by an open-source
data processing pipeline, and (4) A benchmark for the dual-talking-characters
video generation, with extensive experiments demonstrating superior performance
over multiple state-of-the-art methods.
3.20Sampling Matters in Explanations: Towards Trustworthy Attribution Analysis Building Block in Visual Models through Maximizing Explanation Certainty¶
2025/06/24 09:15 GTM
Image attribution analysis seeks to highlight the feature representations learned by visual models such that the highlighted feature maps can reflect the pixel-wise importance of inputs. Gradient integration is a building block in the attribution analysis by integrating the gradients from multiple derived samples to highlight the semantic features relevant to inferences. Such a building block often combines with other information from visual models such as activation or attention maps to form ultimate explanations. Yet, our theoretical analysis demonstrates that the extent to the alignment of the sample distribution in gradient integration with respect to natural image distribution gives a lower bound of explanation certainty. Prior works add noise into images as samples and the noise distributions can lead to low explanation certainty. Counter-intuitively, our experiment shows that extra information can saturate neural networks. To this end, building trustworthy attribution analysis needs to settle the sample distribution misalignment problem. Instead of adding extra information into input images, we present a semi-optimal sampling approach by suppressing features from inputs. The sample distribution by suppressing features is approximately identical to the distribution of natural images. Our extensive quantitative evaluation on large scale dataset ImageNet affirms that our approach is effective and able to yield more satisfactory explanations against state-of-the-art baselines throughout all experimental models.
3.21PEVLM: Parallel Encoding for Vision-Language Models¶
2025/06/24 14:14 GTM
Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from to while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26% to 61.03%. These results highlight PEVLM’s effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.
3.22Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs¶
2025/06/24 10:25 GTM
Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce , a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread “scheming” behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.
3.23VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks¶
2025/06/24 13:39 GTM
Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.
3.24Look to Locate: Vision-Based Multisensory Navigation with 3-D Digital Maps for GNSS-Challenged Environments¶
2025/06/24 17:44 GTM
In Global Navigation Satellite System (GNSS)-denied environments such as indoor parking structures or dense urban canyons, achieving accurate and robust vehicle positioning remains a significant challenge. This paper proposes a cost-effective, vision-based multi-sensor navigation system that integrates monocular depth estimation, semantic filtering, and visual map registration (VMR) with 3-D digital maps. Extensive testing in real-world indoor and outdoor driving scenarios demonstrates the effectiveness of the proposed system, achieving sub-meter accuracy of 92% indoors and more than 80% outdoors, with consistent horizontal positioning and heading average root mean-square errors of approximately 0.98 m and 1.25 {\deg}, respectively. Compared to the baselines examined, the proposed solution significantly reduced drift and improved robustness under various conditions, achieving positioning accuracy improvements of approximately 88% on average. This work highlights the potential of cost-effective monocular vision systems combined with 3D maps for scalable, GNSS-independent navigation in land vehicles.
3.25Look to Locate: Vision-Based Multisensory Navigation with 3-D Digital Maps for GNSS-Challenged Environments¶
2025/06/24 17:44 GTM
In Global Navigation Satellite System (GNSS)-denied environments such as indoor parking structures or dense urban canyons, achieving accurate and robust vehicle positioning remains a significant challenge. This paper proposes a cost-effective, vision-based multi-sensor navigation system that integrates monocular depth estimation, semantic filtering, and visual map registration (VMR) with 3-D digital maps. Extensive testing in real-world indoor and outdoor driving scenarios demonstrates the effectiveness of the proposed system, achieving sub-meter accuracy of 92% indoors and more than 80% outdoors, with consistent horizontal positioning and heading average root mean-square errors of approximately 0.98 m and 1.25 {\deg}, respectively. Compared to the baselines examined, the proposed solution significantly reduced drift and improved robustness under various conditions, achieving positioning accuracy improvements of approximately 88% on average. This work highlights the potential of cost-effective monocular vision systems combined with 3D maps for scalable, GNSS-independent navigation in land vehicles.
3.26CoCo4D: Comprehensive and Complex 4D Scene Generation¶
2025/06/24 17:05 GTM
Existing 4D synthesis methods primarily focus on object-level generation or
dynamic scene synthesis with limited novel views, restricting their ability to
generate multi-view consistent and immersive dynamic 4D scenes. To address
these constraints, we propose a framework (dubbed as CoCo4D) for generating
detailed dynamic 4D scenes from text prompts, with the option to include
images. Our method leverages the crucial observation that articulated motion
typically characterizes foreground objects, whereas background alterations are
less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two
responsibilities: modeling the dynamic foreground and creating the evolving
background, both directed by a reference motion sequence. Given a text prompt
and an optional reference image, CoCo4D first generates an initial motion
sequence utilizing video diffusion models. This motion sequence then guides the
synthesis of both the dynamic foreground object and the background using a
novel progressive outpainting scheme. To ensure seamless integration of the
moving foreground object within the dynamic background, CoCo4D optimizes a
parametric trajectory for the foreground, resulting in realistic and coherent
blending. Extensive experiments show that CoCo4D achieves comparable or
superior performance in 4D scene generation compared to existing methods,
demonstrating its effectiveness and efficiency. More results are presented on
our website https://
3.27Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications¶
2025/06/24 13:00 GTM
Cloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.
3.28Radial Attention: Sparse Attention with Energy Decay for Long Video Generation¶
2025/06/24 17:59 GTM
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9 speedup over the original dense attention. With minimal tuning, it enables video generation up to 4 longer while reducing training costs by up to 4.4 compared to direct fine-tuning and accelerating inference by up to 3.7 compared to dense attention inference.
3.29SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning¶
2025/06/24 16:31 GTM
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
3.30Scaling Speculative Decoding with Lookahead Reasoning¶
2025/06/24 17:48 GTM
Reasoning models excel by generating long chain-of-thoughts, but decoding the
resulting thousands of tokens is slow. Token-level speculative decoding (SD)
helps, but its benefit is capped, because the chance that an entire
-token guess is correct falls exponentially as grows. This
means allocating more compute for longer token drafts faces an algorithmic
ceiling -- making the speedup modest and hardware-agnostic. We raise this
ceiling with Lookahead Reasoning, which exploits a second, step-level layer of
parallelism. Our key insight is that reasoning models generate step-by-step,
and each step needs only to be semantically correct, not exact token matching.
In Lookahead Reasoning, a lightweight draft model proposes several future
steps; the target model expands each proposal in one batched pass, and a
verifier keeps semantically correct steps while letting the target regenerate
any that fail. Token-level SD still operates within each reasoning step, so the
two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak
speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other
benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x
while preserving answer quality, and its speedup scales better with additional
GPU throughput. Our code is available at
https://
3.31UniTac-NV: A Unified Tactile Representation For Non-Vision-Based Tactile Sensors¶
2025/06/24 15:11 GTM
Generalizable algorithms for tactile sensing remain underexplored, primarily due to the diversity of sensor modalities. Recently, many methods for cross-sensor transfer between optical (vision-based) tactile sensors have been investigated, yet little work focus on non-optical tactile sensors. To address this gap, we propose an encoder-decoder architecture to unify tactile data across non-vision-based sensors. By leveraging sensor-specific encoders, the framework creates a latent space that is sensor-agnostic, enabling cross-sensor data transfer with low errors and direct use in downstream applications. We leverage this network to unify tactile data from two commercial tactile sensors: the Xela uSkin uSPa 46 and the Contactile PapillArray. Both were mounted on a UR5e robotic arm, performing force-controlled pressing sequences against distinct object shapes (circular, square, and hexagonal prisms) and two materials (rigid PLA and flexible TPU). Another more complex unseen object was also included to investigate the model’s generalization capabilities. We show that alignment in latent space can be implicitly learned from joint autoencoder training with matching contacts collected via different sensors. We further demonstrate the practical utility of our approach through contact geometry estimation, where downstream models trained on one sensor’s latent representation can be directly applied to another without retraining.
3.32HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions¶
2025/06/24 14:00 GTM
When humans and robotic agents coexist in an environment, scene understanding becomes crucial for the agents to carry out various downstream tasks like navigation and planning. Hence, an agent must be capable of localizing and identifying actions performed by the human. Current research lacks reliable datasets for performing scene understanding within indoor environments where humans are also a part of the scene. Scene Graphs enable us to generate a structured representation of a scene or an image to perform visual scene understanding. To tackle this, we present HOIverse a synthetic dataset at the intersection of scene graph and human-object interaction, consisting of accurate and dense relationship ground truths between humans and surrounding objects along with corresponding RGB images, segmentation masks, depth images and human keypoints. We compute parametric relations between various pairs of objects and human-object pairs, resulting in an accurate and unambiguous relation definitions. In addition, we benchmark our dataset on state-of-the-art scene graph generation models to predict parametric relations and human-object interactions. Through this dataset, we aim to accelerate research in the field of scene understanding involving people.
3.33Visual hallucination detection in large vision-language models via evidential conflict¶
2025/06/24 11:03 GTM
Despite the remarkable multimodal capabilities of Large Vision-Language
Models (LVLMs), discrepancies often occur between visual inputs and textual
outputs--a phenomenon we term visual hallucination. This critical reliability
gap poses substantial risks in safety-critical Artificial Intelligence (AI)
applications, necessitating a comprehensive evaluation benchmark and effective
detection methods. Firstly, we observe that existing visual-centric
hallucination benchmarks mainly assess LVLMs from a perception perspective,
overlooking hallucinations arising from advanced reasoning capabilities. We
develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset,
which enables the systematic evaluation of both perception and reasoning
capabilities of LVLMs across multiple visual semantics, such as instances,
scenes, and relations. Comprehensive evaluation with this new benchmark exposed
more visual vulnerabilities, particularly in the more challenging task of
relation reasoning. To address this issue, we propose, to the best of our
knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination
detection method for LVLMs through uncertainty estimation. This method aims to
efficiently capture the degree of conflict in high-level features at the model
inference phase. Specifically, our approach employs simple mass functions to
mitigate the computational complexity of evidence combination on power sets. We
conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5,
mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results
indicate that our method outperforms five baseline uncertainty metrics,
achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our
code is available at https://
3.34ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing¶
2025/06/24 17:59 GTM
This paper presents ScaleCap, an inference-time scalable image captioning
strategy that generates comprehensive and detailed image captions. The key
challenges of high-quality image captioning lie in the inherent biases of
LVLMs: multimodal bias resulting in imbalanced descriptive granularity,
offering detailed accounts of some elements while merely skimming over others;
linguistic bias leading to hallucinated descriptions of non-existent objects.
To address these issues, we propose a scalable debiased captioning strategy,
which continuously enriches and calibrates the caption with increased inference
budget. Specifically, we propose two novel components: heuristic question
answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate
hallucinations caused by linguistic biases. With increased inference cost, more
heuristic questions are raised by ScaleCap to progressively capture additional
visual details, generating captions that are more accurate, balanced, and
informative. Extensive modality alignment experiments demonstrate the
effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them
for LVLM pretraining leads to consistent performance gains across 11 widely
used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity
of generated captions with two additional tasks: replacing images with captions
in VQA task, and reconstructing images from captions to assess semantic
coverage. Code is available at https://
3.35ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing¶
2025/06/24 17:59 GTM
This paper presents ScaleCap, an inference-time scalable image captioning
strategy that generates comprehensive and detailed image captions. The key
challenges of high-quality image captioning lie in the inherent biases of
LVLMs: multimodal bias resulting in imbalanced descriptive granularity,
offering detailed accounts of some elements while merely skimming over others;
linguistic bias leading to hallucinated descriptions of non-existent objects.
To address these issues, we propose a scalable debiased captioning strategy,
which continuously enriches and calibrates the caption with increased inference
budget. Specifically, we propose two novel components: heuristic question
answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate
hallucinations caused by linguistic biases. With increased inference cost, more
heuristic questions are raised by ScaleCap to progressively capture additional
visual details, generating captions that are more accurate, balanced, and
informative. Extensive modality alignment experiments demonstrate the
effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them
for LVLM pretraining leads to consistent performance gains across 11 widely
used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity
of generated captions with two additional tasks: replacing images with captions
in VQA task, and reconstructing images from captions to assess semantic
coverage. Code is available at https://
3.36Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager¶
2025/06/24 14:15 GTM
In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.
3.37Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models¶
2025/06/24 10:18 GTM
This paper provides preliminary results on exploring the task of performing turn-level data augmentation for dialogue system based on different types of commonsense relationships, and the automatic evaluation of the generated synthetic turns. The proposed methodology takes advantage of the extended knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs) to follow instructions, understand contextual information, and their commonsense reasoning capabilities. The approach draws inspiration from methodologies like Chain-of-Thought (CoT), applied more explicitly to the task of prompt-based generation for dialogue-based data augmentation conditioned on commonsense attributes, and the automatic evaluation of the generated dialogues. To assess the effectiveness of the proposed approach, first we extracted 200 randomly selected partial dialogues, from 5 different well-known dialogue datasets, and generate alternative responses conditioned on different event commonsense attributes. This novel dataset allows us to measure the proficiency of LLMs in generating contextually relevant commonsense knowledge, particularly up to 12 different specific ATOMIC [10] database relations. Secondly, we propose an evaluation framework to automatically detect the quality of the generated dataset inspired by the ACCENT [26] metric, which offers a nuanced approach to assess event commonsense. However, our method does not follow ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an instruction-based prompt for each commonsense attribute and use state-of-the-art LLMs to automatically detect the original attributes used when creating each augmented turn in the previous step. Preliminary results suggest that our approach effectively harnesses LLMs capabilities for commonsense reasoning and evaluation in dialogue systems.
3.38Zero-Shot Parameter Learning of Robot Dynamics Using Bayesian Statistics and Prior Knowledge¶
2025/06/24 06:32 GTM
Inertial parameter identification of industrial robots is an established process, but standard methods using Least Squares or Machine Learning do not consider prior information about the robot and require extensive measurements. Inspired by Bayesian statistics, this paper presents an identification method with improved generalization that incorporates prior knowledge and is able to learn with only a few or without additional measurements (Zero-Shot Learning). Furthermore, our method is able to correctly learn not only the inertial but also the mechanical and base parameters of the MABI Max 100 robot while ensuring physical feasibility and specifying the confidence intervals of the results. We also provide different types of priors for serial robots with 6 degrees of freedom, where datasheets or CAD models are not available.
3.39Virtual Memory for 3D Gaussian Splatting¶
2025/06/24 08:31 GTM
3D Gaussian Splatting represents a breakthrough in the field of novel view synthesis. It establishes Gaussians as core rendering primitives for highly accurate real-world environment reconstruction. Recent advances have drastically increased the size of scenes that can be created. In this work, we present a method for rendering large and complex 3D Gaussian Splatting scenes using virtual memory. By leveraging well-established virtual memory and virtual texturing techniques, our approach efficiently identifies visible Gaussians and dynamically streams them to the GPU just in time for real-time rendering. Selecting only the necessary Gaussians for both storage and rendering results in reduced memory usage and effectively accelerates rendering, especially for highly complex scenes. Furthermore, we demonstrate how level of detail can be integrated into our proposed method to further enhance rendering speed for large-scale scenes. With an optimized implementation, we highlight key practical considerations and thoroughly evaluate the proposed technique and its impact on desktop and mobile devices.
3.40Orthogonal Finetuning Made Scalable¶
2025/06/24 17:59 GTM
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
3.41Orthogonal Finetuning Made Scalable¶
2025/06/24 17:59 GTM
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
3.42In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly¶
2025/06/24 06:33 GTM
In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam’s razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam’s razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.
3.43Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models¶
2025/06/24 15:03 GTM
Extreme activation outliers in Large Language Models (LLMs) critically
degrade quantization performance, hindering efficient on-device deployment.
While channel-wise operations and adaptive gradient scaling are recognized
causes, practical mitigation remains challenging. We introduce Outlier-Safe
Pre-Training (OSP), a practical guideline that proactively prevents outlier
formation rather than relying on post-hoc mitigation. OSP combines three key
innovations: (1) the Muon optimizer, eliminating privileged bases while
maintaining training efficiency; (2) Single-Scale RMSNorm, preventing
channel-wise amplification; and (3) a learnable embedding projection,
redistributing activation magnitudes originating from embedding matrices. We
validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is
the first production-scale LLM trained without such outliers. Under aggressive
4-bit quantization, our OSP model achieves a 35.7 average score across 10
benchmarks (compared to 26.5 for an Adam-trained model), with only a 2%
training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis
(0.04) compared to extreme values (1818.56) in standard models, fundamentally
altering LLM quantization behavior. Our work demonstrates that outliers are not
inherent to LLMs but are consequences of training strategies, paving the way
for more efficient LLM deployment. The source code and pretrained checkpoints
are available at https://
3.44Unified Vision-Language-Action Model¶
2025/06/24 17:59 GTM
Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST’s 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
3.45Unified Vision-Language-Action Model¶
2025/06/24 17:59 GTM
Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST’s 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
3.46Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities¶
2025/06/24 05:18 GTM
Traditional fundus image analysis models focus on single-modal tasks,
ignoring fundus modality complementarity, which limits their versatility.
Recently, retinal foundation models have emerged, but most still remain
modality-specific. Integrating multiple fundus imaging modalities into a single
foundation model is valuable. However, in dynamic environments, data from
different modalities often arrive incrementally, necessitating continual
pre-training. To address this, we propose RetCoP, the first continual
vision-language pre-training framework in the fundus domain, which
incrementally integrates image and text features from different imaging
modalities into a single unified foundation model. To mitigate catastrophic
forgetting in continual pre-training, we introduce a rehearsal strategy
utilizing representative image-text pairs and an off-diagonal information
distillation approach. The former allows the model to revisit knowledge from
previous stages, while the latter explicitly preserves the alignment between
image and text representations. Experiments show that RetCoP outperforms all
the compared methods, achieving the best generalization and lowest forgetting
rate. The code can be found at https://
3.47ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model¶
2025/06/24 17:59 GTM
Multi-task robotic bimanual manipulation is becoming increasingly popular as
it enables sophisticated tasks that require diverse dual-arm collaboration
patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to
understanding the multi-body spatiotemporal dynamics. An existing method
ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual
representation via Gaussian world model for single-arm settings, which ignores
the interaction of multiple embodiments for dual-arm systems with significant
performance drop. In this paper, we propose ManiGaussian++, an extension of
ManiGaussian framework that improves multi-task bimanual manipulation by
digesting multi-body scene dynamics through a hierarchical Gaussian world
model. To be specific, we first generate task-oriented Gaussian Splatting from
intermediate visual features, which aims to differentiate acting and
stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build
a hierarchical Gaussian world model with the leader-follower architecture,
where the multi-body spatiotemporal dynamics is mined for intermediate visual
representation via future scene prediction. The leader predicts Gaussian
Splatting deformation caused by motions of the stabilizing arm, through which
the follower generates the physical consequences resulted from the movement of
the acting arm. As a result, our method significantly outperforms the current
state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in
10 simulated tasks, and achieves 60% success rate on average in 9 challenging
real-world tasks. Our code is available at
https://
3.48Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System¶
2025/06/24 09:00 GTM
Vision-and-Language Navigation (VLN) in large-scale urban environments
requires embodied agents to ground linguistic instructions in complex scenes
and recall relevant experiences over extended time horizons. Prior modular
pipelines offer interpretability but lack unified memory, while end-to-end
(M)LLM agents excel at fusing vision and language yet remain constrained by
fixed context windows and implicit spatial reasoning. We introduce
\textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system
that can augment any VLN backbone. Mem4Nav fuses a sparse octree for
fine-grained voxel indexing with a semantic topology graph for high-level
landmark connectivity, storing both in trainable memory tokens embedded via a
reversible Transformer. Long-term memory (LTM) compresses and retains
historical observations at both octree and graph nodes, while short-term memory
(STM) caches recent multimodal entries in relative coordinates for real-time
obstacle avoidance and local planning. At each step, STM retrieval sharply
prunes dynamic context, and, when deeper history is needed, LTM tokens are
decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and
Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based
LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13
pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW
improvement. Ablations confirm the indispensability of both the hierarchical
map and dual memory modules. Our codes are open-sourced via
https://
3.49Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System¶
2025/06/24 09:00 GTM
Vision-and-Language Navigation (VLN) in large-scale urban environments
requires embodied agents to ground linguistic instructions in complex scenes
and recall relevant experiences over extended time horizons. Prior modular
pipelines offer interpretability but lack unified memory, while end-to-end
(M)LLM agents excel at fusing vision and language yet remain constrained by
fixed context windows and implicit spatial reasoning. We introduce
\textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system
that can augment any VLN backbone. Mem4Nav fuses a sparse octree for
fine-grained voxel indexing with a semantic topology graph for high-level
landmark connectivity, storing both in trainable memory tokens embedded via a
reversible Transformer. Long-term memory (LTM) compresses and retains
historical observations at both octree and graph nodes, while short-term memory
(STM) caches recent multimodal entries in relative coordinates for real-time
obstacle avoidance and local planning. At each step, STM retrieval sharply
prunes dynamic context, and, when deeper history is needed, LTM tokens are
decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and
Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based
LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13
pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW
improvement. Ablations confirm the indispensability of both the hierarchical
map and dual memory modules. Our codes are open-sourced via
https://
3.50Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects¶
2025/06/24 12:45 GTM
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.
3.51Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects¶
2025/06/24 12:45 GTM
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.
3.52Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects¶
2025/06/24 12:45 GTM
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.
3.53Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation¶
2025/06/24 06:33 GTM
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
3.54ReLink: Computational Circular Design of Planar Linkage Mechanisms Using Available Standard Parts¶
2025/06/24 14:22 GTM
The Circular Economy framework emphasizes sustainability by reducing resource consumption and waste through the reuse of components and materials. This paper presents ReLink, a computational framework for the circular design of planar linkage mechanisms using available standard parts. Unlike most mechanism design methods, which assume the ability to create custom parts and infinite part availability, ReLink prioritizes the reuse of discrete, standardized components, thus minimizing the need for new parts. The framework consists of two main components: design generation, where a generative design algorithm generates mechanisms from an inventory of available parts, and inverse design, which uses optimization methods to identify designs that match a user-defined trajectory curve. The paper also examines the trade-offs between kinematic performance and CO2 footprint when incorporating new parts. Challenges such as the combinatorial nature of the design problem and the enforcement of valid solutions are addressed. By combining sustainability principles with kinematic synthesis, ReLink lays the groundwork for further research into computational circular design to support the development of systems that integrate reused components into mechanical products.
3.55RCStat: A Statistical Framework for using Relative Contextualization in Transformers¶
2025/06/24 11:55 GTM
Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.
3.56Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning¶
2025/06/24 10:19 GTM
Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky’s sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard’s conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.
3.57Implementing blind navigation through multi-modal sensing and gait guidance¶
2025/06/24 13:03 GTM
By the year 2023, the global population of individuals with impaired vision has surpassed 220 million. People with impaired vision will find it difficult while finding path or avoiding obstacles, and must ask for auxiliary tools for help. Although traditional aids such as guide canes and guide dogs exist, they still have some shortcomings. In this paper, we present our wearable blind guiding device, what perform navigation guidance through our proposed Gait-based Guiding System. Our device innovatively integrates gait phase analysis for walking guide, and in terms of environmental perception, we use multimodal sensing to acquire diverse environment information. During the experiment, we conducted both indoor and outdoor experiments, and compared with the standard guide cane. The result shows superior performance of our device in blind guidance.
3.58JCAPT: A Joint Modeling Approach for CAPT¶
2025/06/24 05:12 GTM
Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.
3.59KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality¶
2025/06/24 17:17 GTM
Large Language Models (LLMs), particularly slow-thinking models, often
exhibit severe hallucination, outputting incorrect content due to an inability
to accurately recognize knowledge boundaries during reasoning. While
Reinforcement Learning (RL) can enhance complex reasoning abilities, its
outcome-oriented reward mechanism often lacks factual supervision over the
thinking process, further exacerbating the hallucination problem. To address
the high hallucination in slow-thinking models, we propose Knowledge-enhanced
RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by
integrating a factuality reward, based on knowledge verification, into the RL
training process, helping them recognize their knowledge boundaries. KnowRL
guides models to perform fact-based slow thinking by integrating a factuality
reward, based on knowledge verification, into the RL training process, helping
them recognize their knowledge boundaries. This targeted factual input during
RL training enables the model to learn and internalize fact-based reasoning
strategies. By directly rewarding adherence to facts within the reasoning
steps, KnowRL fosters a more reliable thinking process. Experimental results on
three hallucination evaluation datasets and two reasoning evaluation datasets
demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking
models while maintaining their original strong reasoning capabilities. Our code
is available at https://
3.60KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality¶
2025/06/24 17:17 GTM
Large Language Models (LLMs), particularly slow-thinking models, often
exhibit severe hallucination, outputting incorrect content due to an inability
to accurately recognize knowledge boundaries during reasoning. While
Reinforcement Learning (RL) can enhance complex reasoning abilities, its
outcome-oriented reward mechanism often lacks factual supervision over the
thinking process, further exacerbating the hallucination problem. To address
the high hallucination in slow-thinking models, we propose Knowledge-enhanced
RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by
integrating a factuality reward, based on knowledge verification, into the RL
training process, helping them recognize their knowledge boundaries. KnowRL
guides models to perform fact-based slow thinking by integrating a factuality
reward, based on knowledge verification, into the RL training process, helping
them recognize their knowledge boundaries. This targeted factual input during
RL training enables the model to learn and internalize fact-based reasoning
strategies. By directly rewarding adherence to facts within the reasoning
steps, KnowRL fosters a more reliable thinking process. Experimental results on
three hallucination evaluation datasets and two reasoning evaluation datasets
demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking
models while maintaining their original strong reasoning capabilities. Our code
is available at https://
3.61Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning¶
2025/06/24 13:02 GTM
We introduce TAPAS (Task-based Adaptation and Planning using AgentS), a multi-agent framework that integrates Large Language Models (LLMs) with symbolic planning to solve complex tasks without the need for manually defined environment models. TAPAS employs specialized LLM-based agents that collaboratively generate and adapt domain models, initial states, and goal specifications as needed using structured tool-calling mechanisms. Through this tool-based interaction, downstream agents can request modifications from upstream agents, enabling adaptation to novel attributes and constraints without manual domain redefinition. A ReAct (Reason+Act)-style execution agent, coupled with natural language plan translation, bridges the gap between dynamically generated plans and real-world robot capabilities. TAPAS demonstrates strong performance in benchmark planning domains and in the VirtualHome simulated real-world environment.
3.62Video Compression for Spatiotemporal Earth System Data¶
2025/06/24 14:20 GTM
Large-scale Earth system datasets, from high-resolution remote sensing
imagery to spatiotemporal climate model outputs, exhibit characteristics
analogous to those of standard videos. Their inherent spatial, temporal, and
spectral redundancies can thus be readily exploited by established video
compression techniques. Here, we present xarrayvideo, a Python library for
compressing multichannel spatiotemporal datasets by encoding them as videos.
Our approach achieves compression ratios of up to 250x while maintaining high
fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We
demonstrate the library’s effectiveness on four real-world multichannel
spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images),
DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis
data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2
images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58,
and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90,
and 55.04 dB at 1 bpppb. We are redistributing two of these datasets,
DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the
machine-learning-ready and cloud-ready TACO format through HuggingFace at
significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without
compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is
observed when the compressed versions of these datasets are used in their
respective deep learning-based downstream tasks (next step reflectance
prediction and landcover segmentation). In conclusion, xarrayvideo presents an
efficient solution for handling the rapidly growing size of Earth observation
datasets, making advanced compression techniques accessible and practical to
the Earth science community. The library is available for use at
https://
3.63Progressive Modality Cooperation for Multi-Modality Domain Adaptation¶
2025/06/24 05:13 GTM
In this work, we propose a new generic multi-modality domain adaptation framework called Progressive Modality Cooperation (PMC) to transfer the knowledge learned from the source domain to the target domain by exploiting multiple modality clues (\eg, RGB and depth) under the multi-modality domain adaptation (MMDA) and the more general multi-modality domain adaptation using privileged information (MMDA-PI) settings. Under the MMDA setting, the samples in both domains have all the modalities. In two newly proposed modules of our PMC, the multiple modalities are cooperated for selecting the reliable pseudo-labeled target samples, which captures the modality-specific information and modality-integrated information, respectively. Under the MMDA-PI setting, some modalities are missing in the target domain. Hence, to better exploit the multi-modality data in the source domain, we further propose the PMC with privileged information (PMC-PI) method by proposing a new multi-modality data generation (MMG) network. MMG generates the missing modalities in the target domain based on the source domain data by considering both domain distribution mismatch and semantics preservation, which are respectively achieved by using adversarial learning and conditioning on weighted pseudo semantics. Extensive experiments on three image datasets and eight video datasets for various multi-modality cross-domain visual recognition tasks under both MMDA and MMDA-PI settings clearly demonstrate the effectiveness of our proposed PMC framework.
3.64Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders¶
2025/06/24 15:15 GTM
Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts -- e.g., human hands or objects appearing in groups of four -- that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing “conceptual blindspots” -- concepts present in the training data but absent or misrepresented in a model’s generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on DINOv2 features with 32,000 concepts -- the largest such SAE to date -- enabling fine-grained analysis of conceptual disparities. Applied to four popular generative models (Stable Diffusion 1.5/2.1, PixArt, and Kandinsky), our approach reveals specific suppressed blindspots (e.g., bird feeders, DVD discs, and whitespaces on documents) and exaggerated blindspots (e.g., wood background texture and palm trees). At the individual datapoint level, we further isolate memorization artifacts -- instances where models reproduce highly specific visual templates seen during training. Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models by assessing their conceptual fidelity with respect to the underlying data-generating process.
3.65Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?¶
2025/06/24 15:53 GTM
Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.
3.66Self-Supervised Multimodal NeRF for Autonomous Driving¶
2025/06/24 13:32 GTM
In this paper, we propose a Neural Radiance Fields (NeRF) based framework,
referred to as Novel View Synthesis Framework (NVSF). It jointly learns the
implicit neural representation of space and time-varying scene for both LiDAR
and Camera. We test this on a real-world autonomous driving scenario containing
both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs,
our framework is self-supervised, thus eliminating the need for 3D labels. For
efficient training and faster convergence, we introduce heuristic-based image
pixel sampling to focus on pixels with rich information. To preserve the local
features of LiDAR points, a Double Gradient based mask is employed. Extensive
experiments on the KITTI-360 dataset show that, compared to the baseline
models, our framework has reported best performance on both LiDAR and Camera
domain. Code of the model is available at
https://
3.67EvDetMAV: Generalized MAV Detection from Moving Event Cameras¶
2025/06/24 08:35 GTM
Existing micro aerial vehicle (MAV) detection methods mainly rely on the
target’s appearance features in RGB images, whose diversity makes it difficult
to achieve generalized MAV detection. We notice that different types of MAVs
share the same distinctive features in event streams due to their high-speed
rotating propellers, which are hard to see in RGB images. This paper studies
how to detect different types of MAVs from an event camera by fully exploiting
the features of propellers in the original event stream. The proposed method
consists of three modules to extract the salient and spatio-temporal features
of the propellers while filtering out noise from background objects and camera
motion. Since there are no existing event-based MAV datasets, we introduce a
novel MAV dataset for the community. This is the first event-based MAV dataset
comprising multiple scenarios and different types of MAVs. Without training,
our method significantly outperforms state-of-the-art methods and can deal with
challenging scenarios, achieving a precision rate of 83.0% (+30.3%) and a
recall rate of 81.5% (+36.4%) on the proposed testing dataset. The dataset
and code are available at: https://
3.68EvDetMAV: Generalized MAV Detection from Moving Event Cameras¶
2025/06/24 08:35 GTM
Existing micro aerial vehicle (MAV) detection methods mainly rely on the
target’s appearance features in RGB images, whose diversity makes it difficult
to achieve generalized MAV detection. We notice that different types of MAVs
share the same distinctive features in event streams due to their high-speed
rotating propellers, which are hard to see in RGB images. This paper studies
how to detect different types of MAVs from an event camera by fully exploiting
the features of propellers in the original event stream. The proposed method
consists of three modules to extract the salient and spatio-temporal features
of the propellers while filtering out noise from background objects and camera
motion. Since there are no existing event-based MAV datasets, we introduce a
novel MAV dataset for the community. This is the first event-based MAV dataset
comprising multiple scenarios and different types of MAVs. Without training,
our method significantly outperforms state-of-the-art methods and can deal with
challenging scenarios, achieving a precision rate of 83.0% (+30.3%) and a
recall rate of 81.5% (+36.4%) on the proposed testing dataset. The dataset
and code are available at: https://
3.69NEAR: A Nested Embedding Approach to Efficient Product Retrieval and Ranking¶
2025/06/24 16:02 GTM
E-commerce information retrieval (IR) systems struggle to simultaneously achieve high accuracy in interpreting complex user queries and maintain efficient processing of vast product catalogs. The dual challenge lies in precisely matching user intent with relevant products while managing the computational demands of real-time search across massive inventories. In this paper, we propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR, which can achieve up to 12 times efficiency in embedding size at inference time while introducing no extra cost in training and improving performance in accuracy for various encoder-based Transformer models. We validate our approach using different loss functions for the retrieval and ranking task, including multiple negative ranking loss and online contrastive loss, on four different test sets with various IR challenges such as short and implicit queries. Our approach achieves an improved performance over a smaller embedding dimension, compared to any existing models.
3.70One Prototype Is Enough: Single-Prototype Activation for Interpretable Image Classification¶
2025/06/24 17:18 GTM
In this paper, we propose ProtoSolo, a novel deep neural architecture for
interpretable image classification inspired by prototypical networks such as
ProtoPNet. Existing prototype networks usually rely on the collaborative
decision-making of multiple prototypes to achieve the classification and
interpretation of a single category. In contrast, ProtoSolo only requires the
activation of a single prototype to complete the classification. This allows
the network to explain each category decision by only providing the features
that are most similar to the prototype of that category, significantly reducing
the cognitive complexity of the explanation. Secondly, we propose a
feature-based comparison method, which uses feature map instead of full-channel
feature vector as the object of similarity comparison and prototype learning.
This design enables ProtoSolo to utilize richer global information for
classification while relying on a single prototype activation. In addition, we
propose a non-prototype projection learning strategy, which preserves the
information association between the prototype and the training image patches
while avoiding the sharp change of the network structure caused by the
projection operation, thus avoiding its negative impact on the classification
performance. Experiments on the CUB-200-2011 and Stanford Cars datasets show
that ProtoSolo achieves superior performance in classification tasks and
reaches the best level in terms of cognitive complexity of explanations
compared to state-of-the-art interpretable methods. The code is available at
https://
3.71NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling¶
2025/06/24 10:39 GTM
LLMs’ reliance on static knowledge and fragile tool invocation severely hinders the orchestration of complex, heterogeneous toolchains, particularly at large scales. Existing methods typically use rigid single-path execution, resulting in poor error recovery and exponentially growing search spaces. We introduce NaviAgent, a graph-navigated bilevel planning architecture for robust function calling, comprising a Multi-Path Decider and Graph-Encoded Navigator. As an LLM-powered agent, the Multi-Path Decider defines a four-dimensional decision space and continuously perceives environmental states, dynamically selecting the optimal action to fully cover all tool invocation scenarios. The Graph-Encoded Navigator constructs a Tool Dependency Heterogeneous Graph (TDHG), where node embeddings explicitly fuse API schema structure with historical invocation behavior. It also integrates a novel heuristic search strategy that guides the Decider toward efficient and highly successful toolchains, even for unseen tool combinations. Experiments show that NaviAgent consistently achieves the highest task success rate (TSR) across all foundation models and task complexities, outperforming the average baselines (ReAct, ToolLLM, {\alpha}-UMI) by 13.5%, 16.4%, and 19.0% on Qwen2.5-14B, Qwen2.5-32B, and Deepseek-V3, respectively. Its execution steps are typically within one step of the most efficient baseline, ensuring a strong balance between quality and efficiency. Notably, a fine-tuned Qwen2.5-14B model achieves a TSR of 49.5%, surpassing the much larger 32B model (44.9%) under our architecture. Incorporating the Graph-Encoded Navigator further boosts TSR by an average of 2.4 points, with gains up over 9 points on complex tasks for larger models (Deepseek-V3 and GPT-4o), highlighting its essential role in toolchain orchestration.
3.72TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems¶
2025/06/24 09:12 GTM
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
3.73Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling¶
2025/06/24 07:39 GTM
Downscaling is essential for generating the high-resolution climate data
needed for local planning, but traditional methods remain computationally
demanding. Recent years have seen impressive results from AI downscaling
models, particularly diffusion models, which have attracted attention due to
their ability to generate ensembles and overcome the smoothing problem common
in other AI methods. However, these models typically remain computationally
intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model, which
introduces an easily-extensible hierarchical sampling process to the diffusion
framework. A coarse-to-fine hierarchy is imposed via a simple downsampling
scheme. HDD achieves competitive accuracy on ERA5 reanalysis datasets and CMIP6
models, significantly reducing computational load by running on up to half as
many pixels with competitive results. Additionally, a single model trained at
0.25{\deg} resolution transfers seamlessly across multiple CMIP6 models with
much coarser resolution. HDD thus offers a lightweight alternative for
probabilistic climate downscaling, facilitating affordable large-ensemble
high-resolution climate projections. See a full code implementation at:
https://
3.74KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs¶
2025/06/24 11:30 GTM
While Large Language Models (LLMs) possess significant capabilities in open-world agent tasks, they also face challenges in rapidly adapting to new, specialized tasks due to their reliance on static pre-trained knowledge. Traditional methods such as fine-tuning are often costly, data-intensive, and may lead to “catastrophic forgetting.” Therefore, we present KnowMap, a novel approach that dynamically constructs a knowledge base from environmental and experiential data. KnowMap fine-tunes a small knowledge-embedding model to equip a larger LLM with valuable task-specific knowledge. Our experiments on the ScienceWorld benchmark demonstrate 17.71% improvement for the performance of gpt-4-turbo model. KnowMap not only provides an efficient and effective means for LLM task-adapting, but also highlights how integrating environmental and experiential knowledge can enhance LLMs’ reasoning capabilities.
3.75Ground-Effect-Aware Modeling and Control for Multicopters¶
2025/06/24 08:43 GTM
The ground effect on multicopters introduces several challenges, such as
control errors caused by additional lift, oscillations that may occur during
near-ground flight due to external torques, and the influence of ground airflow
on models such as the rotor drag and the mixing matrix. This article collects
and analyzes the dynamics data of near-ground multicopter flight through
various methods, including force measurement platforms and real-world flights.
For the first time, we summarize the mathematical model of the external torque
of multicopters under ground effect. The influence of ground airflow on rotor
drag and the mixing matrix is also verified through adequate experimentation
and analysis. Through simplification and derivation, the differential flatness
of the multicopter’s dynamic model under ground effect is confirmed. To
mitigate the influence of these disturbance models on control, we propose a
control method that combines dynamic inverse and disturbance models, ensuring
consistent control effectiveness at both high and low altitudes. In this
method, the additional thrust and variations in rotor drag under ground effect
are both considered and compensated through feedforward models. The leveling
torque of ground effect can be equivalently represented as variations in the
center of gravity and the moment of inertia. In this way, the leveling torque
does not explicitly appear in the dynamic model. The final experimental results
show that the method proposed in this paper reduces the control error (RMSE) by
\textbf{45.3%}. Please check the supplementary material at:
https://
3.76A Survey on Soft Robot Adaptability: Implementations, Applications, and Prospects¶
2025/06/24 07:53 GTM
Soft robots, compared to rigid robots, possess inherent advantages, including higher degrees of freedom, compliance, and enhanced safety, which have contributed to their increasing application across various fields. Among these benefits, adaptability is particularly noteworthy. In this paper, adaptability in soft robots is categorized into external and internal adaptability. External adaptability refers to the robot’s ability to adjust, either passively or actively, to variations in environments, object properties, geometries, and task dynamics. Internal adaptability refers to the robot’s ability to cope with internal variations, such as manufacturing tolerances or material aging, and to generalize control strategies across different robots. As the field of soft robotics continues to evolve, the significance of adaptability has become increasingly pronounced. In this review, we summarize various approaches to enhancing the adaptability of soft robots, including design, sensing, and control strategies. Additionally, we assess the impact of adaptability on applications such as surgery, wearable devices, locomotion, and manipulation. We also discuss the limitations of soft robotics adaptability and prospective directions for future research. By analyzing adaptability through the lenses of implementation, application, and challenges, this paper aims to provide a comprehensive understanding of this essential characteristic in soft robotics and its implications for diverse applications.
3.77Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR¶
2025/06/24 16:21 GTM
Long-form speech recognition is an application area of increasing research focus. ASR models based on multi-head attention (MHA) are ill-suited to long-form ASR because of their quadratic complexity in sequence length. We build on recent work that has investigated linear complexity recurrent attention (RA) layers for ASR. We find that bidirectional RA layers can match the accuracy of MHA for both short- and long-form applications. We present a strong limited-context attention (LCA) baseline, and show that RA layers are just as accurate while being more efficient. We develop a long-form training paradigm which further improves RA performance, leading to better accuracy than LCA with 44% higher throughput. We also present Direction Dropout, a novel regularization method that improves accuracy, provides fine-grained control of the accuracy/throughput trade-off of bidirectional RA, and enables a new alternating directions decoding mode with even higher throughput.
3.78Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge¶
2025/06/24 13:20 GTM
While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations -- factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summarization. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries using evidence from three search engines. We analyze the results and provide insights into systems’ performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.
3.79Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models¶
2025/06/24 17:42 GTM
Diagrams are widely used to visualize data in publications. The research field of data visualization deals with defining principles and guidelines for the creation and use of these diagrams, which are often not known or adhered to by researchers, leading to misinformation caused by providing inaccurate or incomplete information. In this work, large Vision Language Models (VLMs) are used to analyze diagrams in order to identify potential problems in regards to selected data visualization principles and guidelines. To determine the suitability of VLMs for these tasks, five open source VLMs and five prompting strategies are compared using a set of questions derived from selected data visualization guidelines. The results show that the employed VLMs work well to accurately analyze diagram types (F1-score 82.49 %), 3D effects (F1-score 98.55 %), axes labels (F1-score 76.74 %), lines (RMSE 1.16), colors (RMSE 1.60) and legends (F1-score 96.64 %, RMSE 0.70), while they cannot reliably provide feedback about the image quality (F1-score 0.74 %) and tick marks/labels (F1-score 46.13 %). Among the employed VLMs, Qwen2.5VL performs best, and the summarizing prompting strategy performs best for most of the experimental questions. It is shown that VLMs can be used to automatically identify a number of potential issues in diagrams, such as missing axes labels, missing legends, and unnecessary 3D effects. The approach laid out in this work can be extended for further aspects of data visualization.
3.80AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models¶
2025/06/24 10:45 GTM
Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models (LLMs). Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. We observe that quantizing the KV cache of different tokens has varying impacts on the quality of attention outputs. To systematically investigate this phenomenon, we perform forward error propagation analysis on attention and propose the Anchor Score (AnS) that quantifies the sensitivity of each token’s KV cache to quantization-induced error. Our analysis reveals significant disparities in AnS across tokens, suggesting that preserving a small subset with full precision (FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive quantization scenarios. Based on this insight, we introduce AnTKV, a novel framework that leverages Anchor Token-aware Vector Quantization to compress the KV cache. Furthermore, to support efficient deployment, we design and develop a triton kernel that is fully compatible with FlashAttention, enabling fast online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x higher decoding throughput compared to the FP16 baseline. Our experiment results demonstrate that AnTKV matches or outperforms prior works such as KIVI, SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves significantly lower perplexity under ultra-low-bit quantization on Mistral-7B, with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of 4.73.
3.81ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model¶
2025/06/24 13:09 GTM
In the era of large-scale artificial intelligence, Large Language Models
(LLMs) have made significant strides in natural language processing. However,
they often lack transparency and generate unreliable outputs, raising concerns
about their interpretability. To address this, the Chain of Thought (CoT)
prompting method structures reasoning into step-by-step deductions. Yet, not
all reasoning chains are valid, and errors can lead to unreliable conclusions.
We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation
Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates
the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT
generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By
filtering ineffective chains using structured ordering statistics, ECCoT
improves interpretability, reduces biases, and enhances the trustworthiness of
LLM-based decision-making. Key contributions include the introduction of ECCoT,
MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning
enhancement. Code is released at: https://
3.82LLM-Based Social Simulations Require a Boundary¶
2025/06/24 17:14 GTM
This position paper argues that large language model (LLM)-based social simulations should establish clear boundaries to meaningfully contribute to social science research. While LLMs offer promising capabilities for modeling human-like agents compared to traditional agent-based modeling, they face fundamental limitations that constrain their reliability for social pattern discovery. The core issue lies in LLMs’ tendency towards an ``average persona’’ that lacks sufficient behavioral heterogeneity, a critical requirement for simulating complex social dynamics. We examine three key boundary problems: alignment (simulated behaviors matching real-world patterns), consistency (maintaining coherent agent behavior over time), and robustness (reproducibility under varying conditions). We propose heuristic boundaries for determining when LLM-based simulations can reliably advance social science understanding. We believe that these simulations are more valuable when focusing on (1) collective patterns rather than individual trajectories, (2) agent behaviors aligning with real population averages despite limited variance, and (3) proper validation methods available for testing simulation robustness. We provide a practical checklist to guide researchers in determining the appropriate scope and claims for LLM-based social simulations.
3.83ReactEMG: Zero-Shot, Low-Latency Intent Detection via sEMG¶
2025/06/24 17:28 GTM
Surface electromyography (sEMG) signals show promise for effective
human-computer interfaces, particularly in rehabilitation and prosthetics.
However, challenges remain in developing systems that respond quickly and
reliably to user intent, across different subjects and without requiring
time-consuming calibration. In this work, we propose a framework for EMG-based
intent detection that addresses these challenges. Unlike traditional gesture
recognition models that wait until a gesture is completed before classifying
it, our approach uses a segmentation strategy to assign intent labels at every
timestep as the gesture unfolds. We introduce a novel masked modeling strategy
that aligns muscle activations with their corresponding user intents, enabling
rapid onset detection and stable tracking of ongoing gestures. In evaluations
against baseline methods, considering both accuracy and stability for device
control, our approach surpasses state-of-the-art performance in zero-shot
transfer conditions, demonstrating its potential for wearable robotics and
next-generation prosthetic systems. Our project page is available at:
https://
3.84Measuring and Guiding Monosemanticity¶
2025/06/24 07:18 GTM
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
3.85SoK: Can Synthetic Images Replace Real Data? A Survey of Utility and Privacy of Synthetic Image Generation¶
2025/06/24 06:41 GTM
Advances in generative models have transformed the field of synthetic image generation for privacy-preserving data synthesis (PPDS). However, the field lacks a comprehensive survey and comparison of synthetic image generation methods across diverse settings. In particular, when we generate synthetic images for the purpose of training a classifier, there is a pipeline of generation-sampling-classification which takes private training as input and outputs the final classifier of interest. In this survey, we systematically categorize existing image synthesis methods, privacy attacks, and mitigations along this generation-sampling-classification pipeline. To empirically compare diverse synthesis approaches, we provide a benchmark with representative generative methods and use model-agnostic membership inference attacks (MIAs) as a measure of privacy risk. Through this study, we seek to answer critical questions in PPDS: Can synthetic data effectively replace real data? Which release strategy balances utility and privacy? Do mitigations improve the utility-privacy tradeoff? Which generative models perform best across different scenarios? With a systematic evaluation of diverse methods, our study provides actionable insights into the utility-privacy tradeoffs of synthetic data generation methods and guides the decision on optimal data releasing strategies for real-world applications.
3.86Social Hatred: Efficient Multimodal Detection of Hatemongers¶
2025/06/24 13:16 GTM
Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases. Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.
3.87Online camera-pose-free stereo endoscopic tissue deformation recovery with tissue-invariant vision-biomechanics consistency¶
2025/06/24 07:32 GTM
Tissue deformation recovery based on stereo endoscopic images is crucial for tool-tissue interaction analysis and benefits surgical navigation and autonomous soft tissue manipulation. Previous research suffers from the problems raised from camera motion, occlusion, large tissue deformation, lack of tissue-specific biomechanical priors, and reliance on offline processing. Unlike previous studies where the tissue geometry and deformation are represented by 3D points and displacements, the proposed method models tissue geometry as the 3D point and derivative map and tissue deformation as the 3D displacement and local deformation map. For a single surface point, 6 parameters are used to describe its rigid motion and 3 parameters for its local deformation. The method is formulated under the camera-centric setting, where all motions are regarded as the scene motion with respect to the camera. Inter-frame alignment is realized by optimizing the inter-frame deformation, making it unnecessary to estimate camera pose. The concept of the canonical map is introduced to optimize tissue geometry and deformation in an online approach. Quantitative and qualitative experiments were conducted using in vivo and ex vivo laparoscopic datasets. With the inputs of depth and optical flow, the method stably models tissue geometry and deformation even when the tissue is partially occluded or moving outside the field of view. Results show that the 3D reconstruction accuracy in the non-occluded and occluded areas reaches 0.37±0.27 mm and 0.39±0.21 mm in terms of surface distance, respectively. The method can also estimate surface strain distribution during various manipulations as an extra modality for mechanical-based analysis.
3.88A Comparative Study of NAFNet Baselines for Image Restoration¶
2025/06/24 17:59 GTM
We study NAFNet (Nonlinear Activation Free Network), a simple and efficient deep learning baseline for image restoration. By using CIFAR10 images corrupted with noise and blur, we conduct an ablation study of NAFNet’s core components. Our baseline model implements SimpleGate activation, Simplified Channel Activation (SCA), and LayerNormalization. We compare this baseline to different variants that replace or remove components. Quantitative results (PSNR, SSIM) and examples illustrate how each modification affects restoration performance. Our findings support the NAFNet design: the SimpleGate and simplified attention mechanisms yield better results than conventional activations and attention, while LayerNorm proves to be important for stable training. We conclude with recommendations for model design, discuss potential improvements, and future work.
3.89MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages¶
2025/06/24 09:53 GTM
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench’s alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.
3.90Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress¶
2025/06/24 12:35 GTM
In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics’ capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.
3.91Stylized Structural Patterns for Improved Neural Network Pre-training¶
2025/06/24 09:47 GTM
Modern deep learning models in computer vision require large datasets of real images, which are difficult to curate and pose privacy and legal concerns, limiting their commercial use. Recent works suggest synthetic data as an alternative, yet models trained with it often underperform. This paper proposes a two-step approach to bridge this gap. First, we propose an improved neural fractal formulation through which we introduce a new class of synthetic data. Second, we propose reverse stylization, a technique that transfers visual features from a small, license-free set of real images onto synthetic datasets, enhancing their effectiveness. We analyze the domain gap between our synthetic datasets and real images using Kernel Inception Distance (KID) and show that our method achieves a significantly lower distributional gap compared to existing synthetic datasets. Furthermore, our experiments across different tasks demonstrate the practical impact of this reduced gap. We show that pretraining the EDM2 diffusion model on our synthetic dataset leads to an 11% reduction in FID during image generation, compared to models trained on existing synthetic datasets, and a 20% decrease in autoencoder reconstruction error, indicating improved performance in data representation. Furthermore, a ViT-S model trained for classification on this synthetic data achieves over a 10% improvement in ImageNet-100 accuracy. Our work opens up exciting possibilities for training practical models when sufficiently large real training sets are not available.
3.92A Verification Methodology for Safety Assurance of Robotic Autonomous Systems¶
2025/06/24 13:39 GTM
Autonomous robots deployed in shared human environments, such as agricultural settings, require rigorous safety assurance to meet both functional reliability and regulatory compliance. These systems must operate in dynamic, unstructured environments, interact safely with humans, and respond effectively to a wide range of potential hazards. This paper presents a verification workflow for the safety assurance of an autonomous agricultural robot, covering the entire development life-cycle, from concept study and design to runtime verification. The outlined methodology begins with a systematic hazard analysis and risk assessment to identify potential risks and derive corresponding safety requirements. A formal model of the safety controller is then developed to capture its behaviour and verify that the controller satisfies the specified safety properties with respect to these requirements. The proposed approach is demonstrated on a field robot operating in an agricultural setting. The results show that the methodology can be effectively used to verify safety-critical properties and facilitate the early identification of design issues, contributing to the development of safer robots and autonomous systems.
3.93Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study¶
2025/06/24 08:38 GTM
Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder’s parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model’s parameters.
3.94Estimating Spatially-Dependent GPS Errors Using a Swarm of Robots¶
2025/06/24 15:19 GTM
External factors, including urban canyons and adversarial interference, can lead to Global Positioning System (GPS) inaccuracies that vary as a function of the position in the environment. This study addresses the challenge of estimating a static, spatially-varying error function using a team of robots. We introduce a State Bias Estimation Algorithm (SBE) whose purpose is to estimate the GPS biases. The central idea is to use sensed estimates of the range and bearing to the other robots in the team to estimate changes in bias across the environment. A set of drones moves in a 2D environment, each sampling data from GPS, range, and bearing sensors. The biases calculated by the SBE at estimated positions are used to train a Gaussian Process Regression (GPR) model. We use a Sparse Gaussian process-based Informative Path Planning (IPP) algorithm that identifies high-value regions of the environment for data collection. The swarm plans paths that maximize information gain in each iteration, further refining their understanding of the environment’s positional bias landscape. We evaluated SBE and IPP in simulation and compared the IPP methodology to an open-loop strategy.
3.95Can Large Language Models Capture Human Annotator Disagreements?¶
2025/06/24 09:49 GTM
Human annotation variation (i.e., annotation disagreements) is common in NLP
and often reflects important information such as task subjectivity and sample
ambiguity. While Large Language Models (LLMs) are increasingly used for
automatic annotation to reduce human effort, their evaluation often focuses on
predicting the majority-voted “ground truth” labels. It is still unclear,
however, whether these models also capture informative human annotation
variation. Our work addresses this gap by extensively evaluating LLMs’ ability
to predict annotation disagreements without access to repeated human labels.
Our results show that LLMs struggle with modeling disagreements, which can be
overlooked by majority label-based evaluations. Notably, while RLVR-style
(Reinforcement learning with verifiable rewards) reasoning generally boosts LLM
performance, it degrades performance in disagreement prediction. Our findings
highlight the critical need for evaluating and improving LLM annotators in
disagreement modeling. Code and data at
https://
3.96Image Segmentation using Chan-Vese Active Contours¶
2025/06/24 06:16 GTM
This paper presents a comprehensive derivation and implementation of the Chan-Vese active contour model for image segmentation. The model, derived from the Mumford-Shah variational framework, evolves contours based on regional intensity differences rather than image gradients, making it highly effective for segmenting noisy images or images with weak boundaries. We provide a rigorous mathematical derivation of the level set formulation, including detailed treatment of each energy term using the divergence theorem and curve evolution theory. The resulting algorithm is implemented in Python using finite difference methods with special care to numerical stability, including an upwind entropy scheme and curvature-based regularization. Experimental results on medical and synthetic images demonstrate accurate segmentation, robustness to noise, and superior performance compared to classical edge-based methods. This study confirms the suitability of the Chan-Vese model for complex segmentation tasks and highlights its potential for use in real-world imaging applications.
3.97Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning¶
2025/06/24 05:31 GTM
Multimodal pathology-genomic analysis is critical for cancer survival
prediction. However, existing approaches predominantly integrate formalin-fixed
paraffin-embedded (FFPE) slides with genomic data, while neglecting the
availability of other preservation slides, such as Fresh Froze (FF) slides.
Moreover, as the high-resolution spatial nature of pathology data tends to
dominate the cross-modality fusion process, it hinders effective multimodal
fusion and leads to modality imbalance challenges between pathology and
genomics. These methods also typically require complete data modalities,
limiting their clinical applicability with incomplete modalities, such as
missing either pathology or genomic data. In this paper, we propose a
multimodal survival prediction framework that leverages hypergraph learning to
effectively integrate multi-WSI information and cross-modality interactions
between pathology slides and genomics data while addressing modality imbalance.
In addition, we introduce a memory mechanism that stores previously learned
paired pathology-genomic features and dynamically compensates for incomplete
modalities. Experiments on five TCGA datasets demonstrate that our model
outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality
scenarios, our approach surpasses pathology-only (3.3%) and gene-only models
(7.9%). Code: https://
3.98Experimental Assessment of Neural 3D Reconstruction for Small UAV-based Applications¶
2025/06/24 10:25 GTM
The increasing miniaturization of Unmanned Aerial Vehicles (UAVs) has expanded their deployment potential to indoor and hard-to-reach areas. However, this trend introduces distinct challenges, particularly in terms of flight dynamics and power consumption, which limit the UAVs’ autonomy and mission capabilities. This paper presents a novel approach to overcoming these limitations by integrating Neural 3D Reconstruction (N3DR) with small UAV systems for fine-grained 3-Dimensional (3D) digital reconstruction of small static objects. Specifically, we design, implement, and evaluate an N3DR-based pipeline that leverages advanced models, i.e., Instant-ngp, Nerfacto, and Splatfacto, to improve the quality of 3D reconstructions using images of the object captured by a fleet of small UAVs. We assess the performance of the considered models using various imagery and pointcloud metrics, comparing them against the baseline Structure from Motion (SfM) algorithm. The experimental results demonstrate that the N3DR-enhanced pipeline significantly improves reconstruction quality, making it feasible for small UAVs to support high-precision 3D mapping and anomaly detection in constrained environments. In more general terms, our results highlight the potential of N3DR in advancing the capabilities of miniaturized UAV systems.
3.99Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning¶
2025/06/24 09:53 GTM
In recent years, significant progress has been made in the field of surgical
scene understanding, particularly in the task of Visual Question
Localized-Answering in robotic surgery (Surgical-VQLA). However, existing
Surgical-VQLA models lack deep reasoning capabilities and interpretability in
surgical scenes, which limits their reliability and potential for development
in clinical applications. To address this issue, inspired by the development of
Reasoning Multimodal Large Language Models (MLLMs), we first build the
Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and
Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for
Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage
fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities
by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT).
Furthermore, for an efficient and high-quality rule-based reward system in our
RFT, we design a Multimodal Coherence reward mechanism to mitigate positional
illusions that may arise in surgical scenarios. Experiment results demonstrate
that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in
the Surgical-VQLA task and widely-used MLLMs, while also validating its
reasoning capabilities and the effectiveness of our approach. The code and
dataset will be organized in https://
3.100Robotics Under Construction: Challenges on Job Sites¶
2025/06/24 13:07 GTM
As labor shortages and productivity stagnation increasingly challenge the construction industry, automation has become essential for sustainable infrastructure development. This paper presents an autonomous payload transportation system as an initial step toward fully unmanned construction sites. Our system, based on the CD110R-3 crawler carrier, integrates autonomous navigation, fleet management, and GNSS-based localization to facilitate material transport in construction site environments. While the current system does not yet incorporate dynamic environment adaptation algorithms, we have begun fundamental investigations into external-sensor based perception and mapping system. Preliminary results highlight the potential challenges, including navigation in evolving terrain, environmental perception under construction-specific conditions, and sensor placement optimization for improving autonomy and efficiency. Looking forward, we envision a construction ecosystem where collaborative autonomous agents dynamically adapt to site conditions, optimizing workflow and reducing human intervention. This paper provides foundational insights into the future of robotics-driven construction automation and identifies critical areas for further technological development.
3.101Automatic Posology Structuration : What role for LLMs?¶
2025/06/24 11:25 GTM
Automatically structuring posology instructions is essential for improving medication safety and enabling clinical decision support. In French prescriptions, these instructions are often ambiguous, irregular, or colloquial, limiting the effectiveness of classic ML pipelines. We explore the use of Large Language Models (LLMs) to convert free-text posologies into structured formats, comparing prompt-based methods and fine-tuning against a “pre-LLM” system based on Named Entity Recognition and Linking (NERL). Our results show that while prompting improves performance, only fine-tuned LLMs match the accuracy of the baseline. Through error analysis, we observe complementary strengths: NERL offers structural precision, while LLMs better handle semantic nuances. Based on this, we propose a hybrid pipeline that routes low-confidence cases from NERL (<0.8) to the LLM, selecting outputs based on confidence scores. This strategy achieves 91% structuration accuracy while minimizing latency and compute. Our results show that this hybrid approach improves structuration accuracy while limiting computational cost, offering a scalable solution for real-world clinical use.
3.102Automated Detection of Pre-training Text in Black-box LLMs¶
2025/06/24 08:08 GTM
Detecting whether a given text is a member of the pre-training data of Large Language Models (LLMs) is crucial for ensuring data privacy and copyright protection. Most existing methods rely on the LLM’s hidden information (e.g., model parameters or token probabilities), making them ineffective in the black-box setting, where only input and output texts are accessible. Although some methods have been proposed for the black-box setting, they rely on massive manual efforts such as designing complicated questions or instructions. To address these issues, we propose VeilProbe, the first framework for automatically detecting LLMs’ pre-training texts in a black-box setting without human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to infer the latent mapping feature between the input text and the corresponding output suffix generated by the LLM. Then it performs the key token perturbations to obtain more distinguishable membership features. Additionally, considering real-world scenarios where the ground-truth training text samples are limited, a prototype-based membership classifier is introduced to alleviate the overfitting issue. Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting.
3.103Is an object-centric representation beneficial for robotic manipulation ?¶
2025/06/24 08:23 GTM
Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos. It has been several times presented as a potential way to improve data-efficiency and generalization capabilities to learn an agent on downstream tasks. However, most existing work only evaluates such models on scene decomposition, without any notion of reasoning over the learned representation. Robotic manipulation tasks generally involve multi-object environments with potential inter-object interaction. We thus argue that they are a very interesting playground to really evaluate the potential of existing object-centric work. To do so, we create several robotic manipulation tasks in simulated environments involving multiple objects (several distractors, the robot, etc.) and a high-level of randomization (object positions, colors, shapes, background, initial positions, etc.). We then evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations. Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.
3.104USIS16K: High-Quality Dataset for Underwater Salient Instance Segmentation¶
2025/06/24 09:58 GTM
Inspired by the biological visual system that selectively allocates attention to efficiently identify salient objects or regions, underwater salient instance segmentation (USIS) aims to jointly address the problems of where to look (saliency prediction) and what is there (instance segmentation) in underwater scenarios. However, USIS remains an underexplored challenge due to the inaccessibility and dynamic nature of underwater environments, as well as the scarcity of large-scale, high-quality annotated datasets. In this paper, we introduce USIS16K, a large-scale dataset comprising 16,151 high-resolution underwater images collected from diverse environmental settings and covering 158 categories of underwater objects. Each image is annotated with high-quality instance-level salient object masks, representing a significant advance in terms of diversity, complexity, and scalability. Furthermore, we provide benchmark evaluations on underwater object detection and USIS tasks using USIS16K. To facilitate future research in this domain, the dataset and benchmark models are publicly available.
3.105Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach¶
2025/06/24 16:06 GTM
Background: Symptom Checkers (SCs) provide users with personalized medical information. To prevent performance degradation from algorithm updates, SC developers must evaluate diagnostic performance changes for individual diseases before deployment. However, acquiring sufficient evaluation data for rare diseases is difficult, and manually creating numerous clinical vignettes is costly and impractical. Objective: This study proposes and validates a novel Synthetic Vignette Simulation Approach to evaluate diagnostic performance changes for individual rare diseases following SC algorithm updates. Methods: We used disease-phenotype annotations from the Human Phenotype Ontology (HPO), a knowledge database for rare diseases, to generate synthetic vignettes. With these, we simulated SC interviews to estimate the impact of algorithm updates on real-world diagnostic performance. The method’s effectiveness was evaluated retrospectively by comparing estimated values with actual metric changes using the R 2(R-squared) coefficient. Results: The experiment included eight past SC algorithm updates. For updates on diseases with frequency information in HPO (n=5), the R^2 for recall@8 change was 0.831 (p=0.031), and for precision@8 change, it was 0.78 (p=0.047), indicating the method can predict post-deployment performance. In contrast, large prediction errors occurred for diseases without frequency information (n=3), highlighting its importance. The manual effort to map HPO phenotypes to SC symptoms was approximately 2 hours per disease. Conclusions: Our method enables pre-deployment evaluation of SC algorithm changes for individual rare diseases using a publicly available, expert-created knowledge base. This transparent and low-cost approach allows developers to efficiently improve diagnostic performance for rare diseases, potentially enhancing support for early diagnosis.
3.106General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound¶
2025/06/24 12:00 GTM
With access to large-scale, unlabeled medical datasets, researchers are confronted with two questions: Should they attempt to pretrain a custom foundation model on this medical data, or use transfer-learning from an existing generalist model? And, if a custom model is pretrained, are novel methods required? In this paper we explore these questions by conducting a case-study, in which we train a foundation model on a large regional fetal ultrasound dataset of 2M images. By selecting the well-established DINOv2 method for pretraining, we achieve state-of-the-art results on three fetal ultrasound datasets, covering data from different countries, classification, segmentation, and few-shot tasks. We compare against a series of models pretrained on natural images, ultrasound images, and supervised baselines. Our results demonstrate two key insights: (i) Pretraining on custom data is worth it, even if smaller models are trained on less data, as scaling in natural image pretraining does not translate to ultrasound performance. (ii) Well-tuned methods from computer vision are making it feasible to train custom foundation models for a given medical domain, requiring no hyperparameter tuning and little methodological adaptation. Given these findings, we argue that a bias towards methodological innovation should be avoided when developing domain specific foundation models under common computational resource constraints.
3.107Segment Any 3D-Part in a Scene from a Sentence¶
2025/06/24 05:51 GTM
This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions, extending beyond traditional object-level 3D scene understanding and addressing both data and methodological challenges. Due to the expensive acquisition and annotation burden, existing datasets and methods are predominantly limited to object-level comprehension. To overcome the limitations of data and annotation availability, we introduce the 3D-PU dataset, the first large-scale 3D dataset with dense part annotations, created through an innovative and cost-effective method for constructing synthetic 3D scenes with fine-grained part-level annotations, paving the way for advanced 3D-part scene understanding. On the methodological side, we propose OpenPart3D, a 3D-input-only framework to effectively tackle the challenges of part-level segmentation. Extensive experiments demonstrate the superiority of our approach in open-vocabulary 3D scene understanding tasks at the part level, with strong generalization capabilities across various 3D scene datasets.
3.108Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study¶
2025/06/24 17:04 GTM
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.
3.109Reconsidering Explicit Longitudinal Mammography Alignment for Enhanced Breast Cancer Risk Prediction¶
2025/06/24 06:46 GTM
Regular mammography screening is essential for early breast cancer detection.
Deep learning-based risk prediction methods have sparked interest to adjust
screening intervals for high-risk groups. While early methods focused only on
current mammograms, recent approaches leverage the temporal aspect of
screenings to track breast tissue changes over time, requiring spatial
alignment across different time points. Two main strategies for this have
emerged: explicit feature alignment through deformable registration and
implicit learned alignment using techniques like transformers, with the former
providing more control. However, the optimal approach for explicit alignment in
mammography remains underexplored. In this study, we provide insights into
where explicit alignment should occur (input space vs. representation space)
and if alignment and risk prediction should be jointly optimized. We
demonstrate that jointly learning explicit alignment in representation space
while optimizing risk estimation performance, as done in the current
state-of-the-art approach, results in a trade-off between alignment quality and
predictive performance and show that image-level alignment is superior to
representation-level alignment, leading to better deformation field quality and
enhanced risk prediction accuracy. The code is available at
https://
3.110NAADA: A Noise-Aware Attention Denoising Autoencoder for Dental Panoramic Radiographs¶
2025/06/24 07:23 GTM
Convolutional denoising autoencoders (DAEs) are powerful tools for image restoration. However, they inherit a key limitation of convolutional neural networks (CNNs): they tend to recover low-frequency features, such as smooth regions, more effectively than high-frequency details. This leads to the loss of fine details, which is particularly problematic in dental radiographs where preserving subtle anatomical structures is crucial. While self-attention mechanisms can help mitigate this issue by emphasizing important features, conventional attention methods often prioritize features corresponding to cleaner regions and may overlook those obscured by noise. To address this limitation, we propose a noise-aware self-attention method, which allows the model to effectively focus on and recover key features even within noisy regions. Building on this approach, we introduce the noise-aware attention-enhanced denoising autoencoder (NAADA) network for enhancing noisy panoramic dental radiographs. Compared with the recent state of the art (and much heavier) methods like Uformer, MResDNN etc., our method improves the reconstruction of fine details, ensuring better image quality and diagnostic accuracy.
3.111AMF-MedIT: An Efficient Align-Modulation-Fusion Framework for Medical Image-Tabular Data¶
2025/06/24 09:10 GTM
Multimodal medical analysis combining image and tabular data has gained increasing attention. However, effective fusion remains challenging due to cross-modal discrepancies in feature dimensions and modality contributions, as well as the noise from high-dimensional tabular inputs. To address these problems, we present AMF-MedIT, an efficient Align-Modulation-Fusion framework for medical image and tabular data integration, particularly under data-scarce conditions. To harmonize dimension discrepancies and dynamically adjust modality contributions, we propose the Adaptive Modulation and Fusion (AMF) module, a novel modulation-based fusion paradigm with a streamlined architecture. We first derive the modulation objectives and introduce a modality confidence ratio, enabling the incorporation of prior knowledge into the fusion process. Then, the feature masks, density and leakage losses are proposed to achieve the modulation objectives. Additionally, we introduce FT-Mamba, a powerful tabular encoder leveraging a selective mechanism to handle noisy medical tabular data efficiently. Furthermore, interpretability studies are conducted to explore how different tabular encoders supervise the imaging modality during contrastive pretraining for the first time. Extensive experiments demonstrate that AMF-MedIT achieves a superior balance between multimodal performance and data efficiency while showing strong adaptability to incomplete tabular data. Interpretability analysis also highlights FT-Mamba’s capabilities in extracting distinct tabular features and guiding the image encoder toward more accurate and flexible attention patterns.
3.112UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation¶
2025/06/24 15:00 GTM
Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.
3.113Identifying Physically Realizable Triggers for Backdoored Face Recognition Networks¶
2025/06/24 11:36 GTM
Backdoor attacks embed a hidden functionality into deep neural networks, causing the network to display anomalous behavior when activated by a predetermined pattern in the input Trigger, while behaving well otherwise on public test data. Recent works have shown that backdoored face recognition (FR) systems can respond to natural-looking triggers like a particular pair of sunglasses. Such attacks pose a serious threat to the applicability of FR systems in high-security applications. We propose a novel technique to (1) detect whether an FR network is compromised with a natural, physically realizable trigger, and (2) identify such triggers given a compromised network. We demonstrate the effectiveness of our methods with a compromised FR network, where we are able to identify the trigger (e.g., green sunglasses or red hat) with a top-5 accuracy of 74%, whereas a naive brute force baseline achieves 56% accuracy.
3.114The Starlink Robot: A Platform and Dataset for Mobile Satellite Communication¶
2025/06/24 16:49 GTM
The integration of satellite communication into mobile devices represents a
paradigm shift in connectivity, yet the performance characteristics under
motion and environmental occlusion remain poorly understood. We present the
Starlink Robot, the first mobile robotic platform equipped with Starlink
satellite internet, comprehensive sensor suite including upward-facing camera,
LiDAR, and IMU, designed to systematically study satellite communication
performance during movement. Our multi-modal dataset captures synchronized
communication metrics, motion dynamics, sky visibility, and 3D environmental
context across diverse scenarios including steady-state motion, variable
speeds, and different occlusion conditions. This platform and dataset enable
researchers to develop motion-aware communication protocols, predict
connectivity disruptions, and optimize satellite communication for emerging
mobile applications from smartphones to autonomous vehicles. The project is
available at https://
3.115Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation¶
2025/06/24 14:29 GTM
Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.
3.116Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation¶
2025/06/24 14:29 GTM
Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.
3.117How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text?¶
2025/06/24 17:48 GTM
The spread of cyber hatred has led to communal violence, fueling aggression and conflicts between various religious, ethnic, and social groups, posing a significant threat to social harmony. Despite its critical importance, the classification of communal violent text remains an underexplored area in existing research. This study aims to enhance the accuracy of detecting text that incites communal violence, focusing specifically on Bengali textual data sourced from social media platforms. We introduce a fine-tuned BanglaBERT model tailored for this task, achieving a macro F1 score of 0.60. To address the issue of data imbalance, our dataset was expanded by adding 1,794 instances, which facilitated the development and evaluation of a fine-tuned ensemble model. This ensemble model demonstrated an improved performance, achieving a macro F1 score of 0.63, thus highlighting its effectiveness in this domain. In addition to quantitative performance metrics, qualitative analysis revealed instances where the models struggled with context understanding, leading to occasional misclassifications, even when predictions were made with high confidence. Through analyzing the cosine similarity between words, we identified certain limitations in the pre-trained BanglaBERT models, particularly in their ability to distinguish between closely related communal and non-communal terms. To further interpret the model’s decisions, we applied LIME, which helped to uncover specific areas where the model struggled in understanding context, contributing to errors in classification. These findings highlight the promise of NLP and interpretability tools in reducing online communal violence. Our work contributes to the growing body of research in communal violence detection and offers a foundation for future studies aiming to refine these techniques for better accuracy and societal impact.
3.118HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis¶
2025/06/24 10:00 GTM
Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.
3.119SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images¶
2025/06/24 12:51 GTM
From optical sensors to microwave radars, leveraging the complementary
strengths of remote sensing (RS) sensors is crucial for achieving dense
spatio-temporal monitoring of our planet. In contrast, recent deep learning
models, whether task-specific or foundational, are often specific to single
sensors or to fixed combinations: adapting such models to different sensory
inputs requires both architectural changes and re-training, limiting
scalability and generalization across multiple RS sensors. On the contrary, a
single model able to modulate its feature representations to accept diverse
sensors as input would pave the way to agile and flexible multi-sensor RS data
processing. To address this, we introduce SMARTIES, a generic and versatile
foundation model lifting sensor-specific/dependent efforts and enabling
scalability and generalization to diverse RS sensors: SMARTIES projects data
from heterogeneous sensors into a shared spectrum-aware space, enabling the use
of arbitrary combinations of bands both for training and inference. To obtain
sensor-agnostic representations, we train a single, unified transformer model
reconstructing masked multi-sensor data with cross-sensor token mixup. On both
single- and multi-modal tasks across diverse sensors, SMARTIES outperforms
previous models that rely on sensor-specific pretraining. Our code and
pretrained models are available at https://
3.120Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images¶
2025/06/24 14:48 GTM
Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs. Crucially, it achieves AUC 0.80 in 14 of the biomarker prediction and molecular subtyping tasks and C-index 0.70 in survival cohorts of 5 major cancer types. Moreover, PathLUPI embeddings reveal distinct cellular morphological signatures associated with specific genotypes and related biological pathways within WSIs. By effectively encoding molecular context to refine WSI representations, PathLUPI overcomes a key limitation of existing models and offers a novel strategy to bridge molecular insights with routine pathology workflows for wider clinical application.
3.121Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis¶
2025/06/24 16:06 GTM
The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users’ dialects, social media monitoring, and greater accessibility for Arabic communities.
3.122Probabilistic modelling and safety assurance of an agriculture robot providing light-treatment¶
2025/06/24 13:39 GTM
Continued adoption of agricultural robots postulates the farmer’s trust in the reliability, robustness and safety of the new technology. This motivates our work on safety assurance of agricultural robots, particularly their ability to detect, track and avoid obstacles and humans. This paper considers a probabilistic modelling and risk analysis framework for use in the early development phases. Starting off with hazard identification and a risk assessment matrix, the behaviour of the mobile robot platform, sensor and perception system, and any humans present are captured using three state machines. An auto-generated probabilistic model is then solved and analysed using the probabilistic model checker PRISM. The result provides unique insight into fundamental development and engineering aspects by quantifying the effect of the risk mitigation actions and risk reduction associated with distinct design concepts. These include implications of adopting a higher performance and more expensive Object Detection System or opting for a more elaborate warning system to increase human awareness. Although this paper mainly focuses on the initial concept-development phase, the proposed safety assurance framework can also be used during implementation, and subsequent deployment and operation phases.
3.123Comparative Performance of Finetuned ImageNet Pre-trained Models for Electronic Component Classification¶
2025/06/24 05:48 GTM
Electronic component classification and detection are crucial in manufacturing industries, significantly reducing labor costs and promoting technological and industrial development. Pre-trained models, especially those trained on ImageNet, are highly effective in image classification, allowing researchers to achieve excellent results even with limited data. This paper compares the performance of twelve ImageNet pre-trained models in classifying electronic components. Our findings show that all models tested delivered respectable accuracies. MobileNet-V2 recorded the highest at 99.95%, while EfficientNet-B0 had the lowest at 92.26%. These results underscore the substantial benefits of using ImageNet pre-trained models in image classification tasks and confirm the practical applicability of these methods in the electronics manufacturing sector.
3.124Angio-Diff: Learning a Self-Supervised Adversarial Diffusion Model for Angiographic Geometry Generation¶
2025/06/24 09:31 GTM
Vascular diseases pose a significant threat to human health, with X-ray
angiography established as the gold standard for diagnosis, allowing for
detailed observation of blood vessels. However, angiographic X-rays expose
personnel and patients to higher radiation levels than non-angiographic X-rays,
which are unwanted. Thus, modality translation from non-angiographic to
angiographic X-rays is desirable. Data-driven deep approaches are hindered by
the lack of paired large-scale X-ray angiography datasets. While making
high-quality vascular angiography synthesis crucial, it remains challenging. We
find that current medical image synthesis primarily operates at pixel level and
struggles to adapt to the complex geometric structure of blood vessels,
resulting in unsatisfactory quality of blood vessel image synthesis, such as
disconnections or unnatural curvatures. To overcome this issue, we propose a
self-supervised method via diffusion models to transform non-angiographic
X-rays into angiographic X-rays, mitigating data shortages for data-driven
approaches. Our model comprises a diffusion model that learns the distribution
of vascular data from diffusion latent, a generator for vessel synthesis, and a
mask-based adversarial module. To enhance geometric accuracy, we propose a
parametric vascular model to fit the shape and distribution of blood vessels.
The proposed method contributes a pipeline and a synthetic dataset for X-ray
angiography. We conducted extensive comparative and ablation experiments to
evaluate the Angio-Diff. The results demonstrate that our method achieves
state-of-the-art performance in synthetic angiography image quality and more
accurately synthesizes the geometric structure of blood vessels. The code is
available at https://
3.125heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation¶
2025/06/24 11:03 GTM
This paper presents the approach of our team called heiDS for the ArchEHR-QA 2025 shared task. A pipeline using a retrieval augmented generation (RAG) framework is designed to generate answers that are attributed to clinical evidence from the electronic health records (EHRs) of patients in response to patient-specific questions. We explored various components of a RAG framework, focusing on ranked list truncation (RLT) retrieval strategies and attribution approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a query-dependent-k retrieval strategy, including the existing surprise and autocut methods and two new methods proposed in this work, autocut* and elbow. The experimental results show the benefits of our strategy in producing factual and relevant answers when compared to a fixed-.
3.126Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance¶
2025/06/24 14:49 GTM
Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.
3.127ReMAR-DS: Recalibrated Feature Learning for Metal Artifact Reduction and CT Domain Transformation¶
2025/06/24 11:34 GTM
Artifacts in kilo-Voltage CT (kVCT) imaging degrade image quality, impacting clinical decisions. We propose a deep learning framework for metal artifact reduction (MAR) and domain transformation from kVCT to Mega-Voltage CT (MVCT). The proposed framework, ReMAR-DS, utilizes an encoder-decoder architecture with enhanced feature recalibration, effectively reducing artifacts while preserving anatomical structures. This ensures that only relevant information is utilized in the reconstruction process. By infusing recalibrated features from the encoder block, the model focuses on relevant spatial regions (e.g., areas with artifacts) and highlights key features across channels (e.g., anatomical structures), leading to improved reconstruction of artifact-corrupted regions. Unlike traditional MAR methods, our approach bridges the gap between high-resolution kVCT and artifact-resistant MVCT, enhancing radiotherapy planning. It produces high-quality MVCT-like reconstructions, validated through qualitative and quantitative evaluations. Clinically, this enables oncologists to rely on kVCT alone, reducing repeated high-dose MVCT scans and lowering radiation exposure for cancer patients.
3.128SAM2-SGP: Enhancing SAM2 for Medical Image Segmentation via Support-Set Guided Prompting¶
2025/06/24 14:22 GTM
Although new vision foundation models such as Segment Anything Model 2 (SAM2)
have significantly enhanced zero-shot image segmentation capabilities, reliance
on human-provided prompts poses significant challenges in adapting SAM2 to
medical image segmentation tasks. Moreover, SAM2’s performance in medical image
segmentation was limited by the domain shift issue, since it was originally
trained on natural images and videos. To address these challenges, we proposed
SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated
the need for manual prompts. The proposed model leveraged the memory mechanism
of SAM2 to generate pseudo-masks using image-mask pairs from a support set via
a Pseudo-mask Generation (PMG) module. We further introduced a novel
Pseudo-mask Attention (PMA) module, which used these pseudo-masks to
automatically generate bounding boxes and enhance localized feature extraction
by guiding attention to relevant areas. Furthermore, a low-rank adaptation
(LoRA) strategy was adopted to mitigate the domain shift issue. The proposed
framework was evaluated on both 2D and 3D datasets across multiple medical
imaging modalities, including fundus photography, X-ray, computed tomography
(CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and
ultrasound. The results demonstrated a significant performance improvement over
state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation
models, such as SAM2 and MedSAM2, underscoring the effectiveness of the
proposed approach. Our code is publicly available at
https://
3.129Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection¶
2025/06/24 11:54 GTM
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
3.130Soft Robotic Delivery of Coiled Anchors for Cardiac Interventions¶
2025/06/24 13:13 GTM
Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheterbased platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedures is the implantation of anchor coils, which can be used to fix and implant various devices in the beating heart. We introduce a robotic platform capable of delivering anchor coils. We develop a kineto-statics model of the robotic platform and demonstrate low positional error. We leverage the passive compliance and high force output of the actuator in a multi-anchor delivery procedure against a motile in-vitro simulator with millimeter level accuracy.
3.131Filling of incomplete sinograms from sparse PET detector configurations using a residual U-Net¶
2025/06/24 13:10 GTM
Long axial field-of-view PET scanners offer increased field-of-view and sensitivity compared to traditional PET scanners. However, a significant cost is associated with the densely packed photodetectors required for the extended-coverage systems, limiting clinical utilisation. To mitigate the cost limitations, alternative sparse system configurations have been proposed, allowing an extended field-of-view PET design with detector costs similar to a standard PET system, albeit at the expense of image quality. In this work, we propose a deep sinogram restoration network to fill in the missing sinogram data. Our method utilises a modified Residual U-Net, trained on clinical PET scans from a GE Signa PET/MR, simulating the removal of 50% of the detectors in a chessboard pattern (retaining only 25% of all lines of response). The model successfully recovers missing counts, with a mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domain. Notably, the predicted sinograms exhibit a smoothing effect, leading to reconstructed images lacking sharpness in finer details. Despite these limitations, the model demonstrates a substantial capacity for compensating for the undersampling caused by the sparse detector configuration. This proof-of-concept study suggests that sparse detector configurations, combined with deep learning techniques, offer a viable alternative to conventional PET scanner designs. This approach supports the development of cost-effective, total body PET scanners, allowing a significant step forward in medical imaging technology.
3.132MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration¶
2025/06/24 17:52 GTM
Recent advancements in medical Large Language Models (LLMs) have showcased
their powerful reasoning and diagnostic capabilities. Despite their success,
current unified multimodal medical LLMs face limitations in knowledge update
costs, comprehensiveness, and flexibility. To address these challenges, we
introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis
(MAM). Inspired by our empirical findings highlighting the benefits of role
assignment and diagnostic discernment in LLMs, MAM decomposes the medical
diagnostic process into specialized roles: a General Practitioner, Specialist
Team, Radiologist, Medical Assistant, and Director, each embodied by an
LLM-based agent. This modular and collaborative framework enables efficient
knowledge updates and leverages existing medical LLMs and knowledge bases.
Extensive experimental evaluations conducted on a wide range of publicly
accessible multimodal medical datasets, incorporating text, image, audio, and
video modalities, demonstrate that MAM consistently surpasses the performance
of modality-specific LLMs. Notably, MAM achieves significant performance
improvements ranging from 18% to 365% compared to baseline models. Our code is
released at https://
3.133ReCoGNet: Recurrent Context-Guided Network for 3D MRI Prostate Segmentation¶
2025/06/24 14:56 GTM
Prostate gland segmentation from T2-weighted MRI is a critical yet challenging task in clinical prostate cancer assessment. While deep learning-based methods have significantly advanced automated segmentation, most conventional approaches-particularly 2D convolutional neural networks (CNNs)-fail to leverage inter-slice anatomical continuity, limiting their accuracy and robustness. Fully 3D models offer improved spatial coherence but require large amounts of annotated data, which is often impractical in clinical settings. To address these limitations, we propose a hybrid architecture that models MRI sequences as spatiotemporal data. Our method uses a deep, pretrained DeepLabV3 backbone to extract high-level semantic features from each MRI slice and a recurrent convolutional head, built with ConvLSTM layers, to integrate information across slices while preserving spatial structure. This combination enables context-aware segmentation with improved consistency, particularly in data-limited and noisy imaging conditions. We evaluate our method on the PROMISE12 benchmark under both clean and contrast-degraded test settings. Compared to state-of-the-art 2D and 3D segmentation models, our approach demonstrates superior performance in terms of precision, recall, Intersection over Union (IoU), and Dice Similarity Coefficient (DSC), highlighting its potential for robust clinical deployment.
3.134Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks¶
2025/06/24 09:46 GTM
The success of deep learning in medical imaging applications has led several
companies to deploy proprietary models in diagnostic workflows, offering
monetized services. Even though model weights are hidden to protect the
intellectual property of the service provider, these models are exposed to
model stealing (MS) attacks, where adversaries can clone the model’s
functionality by querying it with a proxy dataset and training a thief model on
the acquired predictions. While extensively studied on general vision tasks,
the susceptibility of medical imaging models to MS attacks remains inadequately
explored. This paper investigates the vulnerability of black-box medical
imaging models to MS attacks under realistic conditions where the adversary
lacks access to the victim model’s training data and operates with limited
query budgets. We demonstrate that adversaries can effectively execute MS
attacks by using publicly available datasets. To further enhance MS
capabilities with limited query budgets, we propose a two-step model stealing
approach termed QueryWise. This method capitalizes on unlabeled data obtained
from a proxy distribution to train the thief model without incurring additional
queries. Evaluation on two medical imaging models for Gallbladder Cancer and
COVID-19 classification substantiates the effectiveness of the proposed attack.
The source code is available at https://
3.135A Global-Local Cross-Attention Network for Ultra-high Resolution Remote Sensing Image Semantic Segmentation¶
2025/06/24 08:20 GTM
With the rapid development of ultra-high resolution (UHR) remote sensing technology, the demand for accurate and efficient semantic segmentation has increased significantly. However, existing methods face challenges in computational efficiency and multi-scale feature fusion. To address these issues, we propose GLCANet (Global-Local Cross-Attention Network), a lightweight segmentation framework designed for UHR remote sensing imagery.GLCANet employs a dual-stream architecture to efficiently fuse global semantics and local details while minimizing GPU usage. A self-attention mechanism enhances long-range dependencies, refines global features, and preserves local details for better semantic consistency. A masked cross-attention mechanism also adaptively fuses global-local features, selectively enhancing fine-grained details while exploiting global context to improve segmentation accuracy. Experimental results show that GLCANet outperforms state-of-the-art methods regarding accuracy and computational efficiency. The model effectively processes large, high-resolution images with a small memory footprint, providing a promising solution for real-world remote sensing applications.
3.136MambaOutRS: A Hybrid CNN-Fourier Architecture for Remote Sensing Image Classification¶
2025/06/24 12:20 GTM
Recent advances in deep learning for vision tasks have seen the rise of State Space Models (SSMs) like Mamba, celebrated for their linear scalability. However, their adaptation to 2D visual data often necessitates complex modifications that may diminish efficiency. In this paper, we introduce MambaOutRS, a novel hybrid convolutional architecture for remote sensing image classification that re-evaluates the necessity of recurrent SSMs. MambaOutRS builds upon stacked Gated CNN blocks for local feature extraction and introduces a novel Fourier Filter Gate (FFG) module that operates in the frequency domain to capture global contextual information efficiently. Our architecture employs a four-stage hierarchical design and was extensively evaluated on challenging remote sensing datasets: UC Merced, AID, NWPU-RESISC45, and EuroSAT. MambaOutRS consistently achieved state-of-the-art (SOTA) performance across these benchmarks. Notably, our MambaOutRS-t variant (24.0M parameters) attained the highest F1-scores of 98.41% on UC Merced and 95.99% on AID, significantly outperforming existing baselines, including larger transformer models and Mamba-based architectures, despite using considerably fewer parameters. An ablation study conclusively demonstrates the critical role of the Fourier Filter Gate in enhancing the model’s ability to capture global spatial patterns, leading to robust and accurate classification. These results strongly suggest that the complexities of recurrent SSMs can be effectively superseded by a judicious combination of gated convolutions for spatial mixing and frequency-based gates for spectral global context. Thus, MambaOutRS provides a compelling and efficient paradigm for developing high-performance deep learning models in remote sensing and other vision domains, particularly where computational efficiency is paramount.
3.137Learning from Anatomy: Supervised Anatomical Pretraining (SAP) for Improved Metastatic Bone Disease Segmentation in Whole-Body MRI¶
2025/06/24 12:59 GTM
The segmentation of metastatic bone disease (MBD) in whole-body MRI (WB-MRI) is a challenging problem. Due to varying appearances and anatomical locations of lesions, ambiguous boundaries, and severe class imbalance, obtaining reliable segmentations requires large, well-annotated datasets capturing lesion variability. Generating such datasets requires substantial time and expertise, and is prone to error. While self-supervised learning (SSL) can leverage large unlabeled datasets, learned generic representations often fail to capture the nuanced features needed for accurate lesion detection. In this work, we propose a Supervised Anatomical Pretraining (SAP) method that learns from a limited dataset of anatomical labels. First, an MRI-based skeletal segmentation model is developed and trained on WB-MRI scans from healthy individuals for high-quality skeletal delineation. Then, we compare its downstream efficacy in segmenting MBD on a cohort of 44 patients with metastatic prostate cancer, against both a baseline random initialization and a state-of-the-art SSL method. SAP significantly outperforms both the baseline and SSL-pretrained models, achieving a normalized surface Dice of 0.76 and a Dice coefficient of 0.64. The method achieved a lesion detection F2 score of 0.44, improving on 0.24 (baseline) and 0.31 (SSL). When considering only clinically relevant lesions larger than 1~ml, SAP achieves a detection sensitivity of 100% in 28 out of 32 patients. Learning bone morphology from anatomy yields an effective and domain-relevant inductive bias that can be leveraged for the downstream segmentation task of bone lesions. All code and models are made publicly available.
3.138NeRF-based CBCT Reconstruction needs Normalization and Initialization¶
2025/06/24 16:01 GTM
Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder’s parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions.
3.139Systematic Review of Pituitary Gland and Pituitary Adenoma Automatic Segmentation Techniques in Magnetic Resonance Imaging¶
2025/06/24 17:05 GTM
Purpose: Accurate segmentation of both the pituitary gland and adenomas from magnetic resonance imaging (MRI) is essential for diagnosis and treatment of pituitary adenomas. This systematic review evaluates automatic segmentation methods for improving the accuracy and efficiency of MRI-based segmentation of pituitary adenomas and the gland itself. Methods: We reviewed 34 studies that employed automatic and semi-automatic segmentation methods. We extracted and synthesized data on segmentation techniques and performance metrics (such as Dice overlap scores). Results: The majority of reviewed studies utilized deep learning approaches, with U-Net-based models being the most prevalent. Automatic methods yielded Dice scores of 0.19--89.00% for pituitary gland and 4.60--96.41% for adenoma segmentation. Semi-automatic methods reported 80.00--92.10% for pituitary gland and 75.90--88.36% for adenoma segmentation. Conclusion: Most studies did not report important metrics such as MR field strength, age and adenoma size. Automated segmentation techniques such as U-Net-based models show promise, especially for adenoma segmentation, but further improvements are needed to achieve consistently good performance in small structures like the normal pituitary gland. Continued innovation and larger, diverse datasets are likely critical to enhancing clinical applicability.