Cascade Transformers for End-to-End Person Search Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning Long-Tailed Recognition via Weight Balancing InfoGCN: Representation Learning for Human Skeleton-based Action Recognition Interactive Geometry Editing of Neural Radiance Fields MLSLT: Towards Multilingual Sign Language Translation 360MonoDepth: High-Resolution 360° Monocular Depth Estimation Generating Diverse and Natural 3D Human Motions from Text Masked-attention Mask Transformer for Universal Image Segmentation Pointly-Supervised Instance Segmentation A Closer Look at Few-shot Image Generation Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation Neural 3D Scene Reconstruction with the Manhattan-world Assumption Masked Autoencoders Are Scalable Vision Learners De-rendering 3D Objects in the Wild Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction Finding Badly Drawn Bunnies GradViT: Gradient Inversion of Vision Transformers On the Importance of Asymmetry for Siamese Representation Learning Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks Rethinking Efficient Lane Detection via Curve Modeling StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis Learning Fair Classifiers with Partially Annotated Group Labels Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis A ConvNet for the 2020s Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast Connecting the Complementary-view Videos: Joint Camera Identification and Subject Association Decoupled Knowledge Distillation Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation Compound Domain Generalization via Meta-Knowledge Encoding Bilateral Video Magnification Filter EDTER: Edge Detection with Transformer Structure-Aware Motion Transfer with Deformable Anchor Model Attentive Fine-Grained Structured Sparsity for Image Restoration Sign Language Video Retrieval with Free-Form Textual Queries SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems Neural Mean Discrepancy for Efficient Out-of-Distribution Detection LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints Focal and Global Knowledge Distillation for Detectors Enhancing Adversarial Robustness for Deep Metric Learning Novel Class Discovery in Semantic Segmentation IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding Alignment WarpingGAN:Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection HyperDet3D: Learning a Scene-Conditioned 3D Object Detector Deep Decomposition for Stochastic Normal-Abnormal Transport Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production Self-supervised Video Transformers HLRTF: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging φ-SfT: Shape-from-Template with a Physics-based Deformation Model Boosting View Synthesis with Residual Transfer DINE: Domain Adaptation from Single and Multiple Black-box Predictors Occluded Human Mesh Recovery Understanding Uncertainty Maps in Vision with Statistical Testing Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets Learning from Pixel-Level Label Noise: A New Perspective for Light Field Salient Object Detection Self-Supervised Global-Local Structure Modeling for Point Cloud Domain Adaptation with Reliable Voted Pseudo Labels Towards An End-to-End Framework for Flow-Guided Video Inpainting E-CIR: Event-Enhanced Continuous Intensity Recovery Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization using Satellite Image Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers Forward Propagation, Backward Regression and Pose Association for Hand Tracking in the Wild FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos Efficient Neural Radiance Fields Robust Equivariant Imaging: a fully unsupervised framework for learning to image from noisy and partial measurements HumanNeRF: Efficiently Generated Human Radiance Field from Sparse Inputs Attributable Visual Similarity Learning Efficient Multi-view Stereo by Iterative Dynamic Cost Volume Replacing Labeled Real-image Datasets with Auto-generated Contours SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere Images AutoSDF: Shape Priors for 3D Completion, Reconstruction, and Generation MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions PIE-Net: Photometric Invariant Edge Guided Network for Intrinsic Image Decomposition DST: Dynamic Substitute Training for Data-free Black-box Attack HCSC: Hierarchical Contrastive Selective Coding Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis Inertia-Guided Flow Completion and Style Fusion for Video Inpainting PlaneMVS: 3D Plane Reconstruction from Multi-View Stereo Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields Interactiveness Field of Human-Object Interactions Learning Memory-Augmented Unidirectional Metrics for Cross-modality Person Re-identification Event-based Video Reconstruction via Potential-assisted Spiking Neural Network SIGMA: Semantic-complete Graph Matching for Domain Adaptive Object Detection Surface Reconstruction from Point Clouds by Learning Predictive Context Priors Active Teacher for Semi-Supervised Object Detection Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning RCL: Recurrent Continuous Localization for Temporal Action Detection GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction with Relational Reasoning SPAMs: Structured Implicit Parametric Models A Keypoint-based Global Association Network for Lane Detection Weakly Supervised Semantic Segmentation using Out-of-Distribution Data BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment Investigating Tradeoffs in Real-World Video Super-Resolution OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction Bending Graphs: Hierarchical Shape Matching using Gated Optimal Transport The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization SimT: Handling Open-set Noise for Domain Adaptive Semantic Segmentation Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation Stratified Transformer for 3D Point Cloud Segmentation Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification ImplicitAtlas: Learning Deformable Shape Templates in Medical Imaging Sparse Instance Activation for Real-Time Instance Segmentation Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer Unsupervised Image-to-Image Translation with Generative Prior Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation Versatile Multi-Modal Pre-Training for Human-Centric Perception Instance-wise Occlusion and Depth Orders in Natural Scenes Degradation-agnostic Correspondence from Resolution-asymmetric Stereo No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces Multi-Dimensional with Intensity: A Crowd-sourced Method for Measuring the Perception of Facial Expression Class-Incremental Learning with Strong Pretrained Models A Patch-centric Error Analysis of Image Super-Resolution IFOR: Iterative Flow Minimization for Robotic Object Rearrangement 3D-aware Image Synthesis via Learning Structural and Textural Representations DeeCap: Dynamic Early Exiting for Efficient Image Captioning GAN-Supervised Dense Visual Alignment Multilayer GAN Inversion and Editing On Aliased Resizing and Surprising Subtleties in GAN Evaluation Learning Pixel Trajectories with Multiscale Contrastive Random Walks Comparing Correspondences: Video Prediction with Correspondences-wise Losses Mix and Localize: Localizing Sound Sources from Mixtures AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D Perception Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time Point Cloud Pre-training with Natural 3D Structures Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction Error Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection Reversible Vision Transformers RigNeRF: Fully Controllable Neural 3D Portraits Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation Integrative Few-Shot Learning for Classification and Segmentation Learning Affordance Grounding from Exocentric Images Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection Exploring Geometry Consistency for monocular 3D object detection Visual Abductive Reasoning Putting People in their Place: Monocular Regression of 3D People in Depth Exploiting Explainable Metrics for Augmented SGD Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation A Hybrid Quantum-Classical Algorithm for Robust Fitting Dataset Distillation by Matching Training Trajectories DiLiGenT10^2: A Photometric Stereo Benchmark Dataset with Controlled Shape and Material Variation Scene Representation Transformer ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion Injecting Visual Concepts into End-to-End Image Captioning Learning Neural Light Fields with Ray-Space Embedding Networks What's in your hands? 3D Reconstruction of Generic Objects in Hands Virtual Correspondences: Human as a Cue for Extreme-View Geometry Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches GroupViT: Zero-Shot Transfer to Semantic Segmentation with Text Supervision LSVC: A Learning-based Stereo Video Compression Framework BEHAVE: Dataset and Method for Tracking Human Object Interactions Learning to Align Sequential Actions in the Wild Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction Simulated Adversarial Testing of Face Recognition Models GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping Ensembling Off-the-shelf Models for GAN Training Global Tracking Transformers Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline Joint Global and Local Hierarchical Priors for Learned Image Compression D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions Human-Aware Object Placement for Visual Environment Reconstruction Dual-path Image Inpainting with Auxiliary GAN Inversion Accurate 3D Body Shape Regression using Metric and Semantic Attributes BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information Capturing and Inferring Dense Full-Body Human-Scene Contact Not All Labels Are Equal: Rationalizing The Labeling Costs for Training Object Detection Background Activation Suppression for Weakly Supervised Object Localization Attribute Group Editing for Reliable Few-shot Image Generation Negative-aware Attention for Image-Text Matching Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening gDNA: Towards Generative Detailed Neural Avatars CaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural Homeomorphism BACON: Band-limited Coordinate Networks for Multiscale Scene Representation Revisiting Near/Remote Sensing with Geospatial Attention Simple multi-dataset detection Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual Normalization Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation Online Convolutional Re-parameterization Neural Inertial Localization MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution Unsupervised Pre-training for Temporal Action Localization Tasks Augmented Geometric Distillation for Data-Free Incremental Person ReID HEAT: Holistic Edge Attention Transformer for Structured Reconstruction NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition ContrastMask: Contrastive Learning to Segment Every Thing Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs MAT: Mask-Aware Transformer for Large Hole Image Inpainting A Comprehensive Study of End-to-End Temporal Action Detection Rethinking Image Cropping: Exploring Diverse Compositions from Global Views OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation Asynchronous Event-based Graph-Neural Networks RAMA: A Rapid Multicut Algorithm on GPU EvUnroll: Neuromorphic Events based Rolling Shutter Image Correction Cycle-Consistent Counterfactuals by Latent Transformations Understanding 3D Object Articulation in Internet Videos Synthetic Generation of Face Videos with Plethysmograph Physiology MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object Detection Neural Architecture Search with Representation Mutual Information Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots Semi-Supervised Object Detection via Multi-instance Alignment with Global Class Prototypes Fine-Grained Predicates Learning for Scene Graph Generation Meta Distribution Alignment for Generalizable Person Re-Identification Align Representations with Base: A New Approach to Self-Supervised Learning Style-Based Global Appearance Flow for Virtual Try-On Learning Semantic Associations for Mirror Detection Task Decoupled Framework for Reference-based Super-Resolution Beyond Semantic to Instance Segmentation: Weakly-Supervised Instance Segmentation via Semantic Knowledge Transfer and Self-Refinement Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras Fast and Unsupervised Action Boundary Detection for Action Segmentation Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture Unified Transformer Tracker for Object Tracking NeuralHOFusion: Neural Volumetric Rendering under Human-object Interactions H$^2$FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-domain Weakly Supervised Object Detection ICON: Implicit Clothed humans Obtained from Normals Semantic-Aware Domain Generalized Segmentation ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation Detecting Deepfakes with Self-Blended Images Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization FreeSOLO: Learning to Segment Objects without Annotations Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage Differentially Private Federated Learning with Local Regularization and Sparsification Modeling 3D Layout For Group Re-Identification DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning Structured Local Radiance Fields for Human Avatar Modeling Contrastive Regression for Domain Adaptation on Gaze Estimation Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification Tree Energy Loss: Towards Sparsely Annotated Semantic Segmentation Learning Second Order Local Anomaly for General Face Forgery Detection LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network Audio-Adaptive Activity Recognition Across Video Domains Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos Omnivore: A Single Model for Many Visual Modalities Multi-Frame Self-Supervised Depth with Transformers Voice-Face Homogeneity Tells Deepfake Representation Compensation Networks for Continual Semantic Segmentation Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation FLAVA: A Foundational Language And Vision Alignment Model Vision Prompt Tuning Vehicle trajectory prediction works, but not everywhere Camera-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentification ReSTR: Convolution-free Referring Image Segmentation Using Transformers DATA: Domain-Aware and Task-Aware Self-supervised Learning Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval Balanced MSE for Imbalanced Visual Regression The Devil Is in the Details: Window-based Attention for Image Compression DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding Video Frame Interpolation Transformer Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling LASER: LAtent SpacE Rendering for 2D Visual Localization LaTr: Layout-Aware Transformer for Scene-Text VQA Universal Photometric Stereo Network using Global Lighting Contexts Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis AdaViT: Adaptive Tokens for Efficient Vision Transformer Neural Template: Topology-aware Reconstruction and Disentangled Generation of 3D Meshes CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition Cross-Modal Transferable Adversarial Attacks from Images to Videos PTTR: Relational 3D Point Cloud Object Tracking with Transformer Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation Object Localization under Single Coarse Point Supervision Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation TubeDETR: Spatio-Temporal Video Grounding with Transformers Reinforced Structured State-Evolution for Vision-Language Navigation Learning to Anticipate Future with Dynamic Context Removal Learning Program Representations for Food Images and Cooking Recipes Transferability Estimation using Bhattacharyya Class Separability LiDAR Snowfall Simulation for Robust 3D Object Detection Masked Feature Prediction for Vision Self-Supervised Pre-Training Unbiased Teacher v2: Semi-supervised Object Detection for Anchor-free and Anchor-based Detectors Shape from Polarization for Complex Scenes in the Wild PhotoScene: Physically-Based Material and Lighting Transfer for Indoor Scenes Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization Selective-Supervised Contrastive Learning with Noisy Labels LAVT: Language-Aware Vision Transformer for Referring Image Segmentation L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing Leveraging Self-Supervision for Cross-Domain Crowd Counting Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency TimeReplayer: Unlocking the Potential of Event Cameras for Video Interpolation Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segmentation Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation Probabilistic Warp Consistency for Weakly-Supervised Semantic Correspondences DIFNet: Boosting Visual Information Flow for Image Captioning ScaleNet: A Shallow Architecture for Scale Estimation HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images Density-preserving Deep Point Cloud Compression Exploring Dual-task Correlation for Pose Guided Person Image Generation Exploring Endogenous Shift for Cross-domain Detection: A Large-scale Benchmark and Perturbation Suppression Network Transferability metrics for selecting Source Model Ensembles The Auto Arborist Dataset: A Large-Scale Benchmark for Multimodal Urban Forest Monitoring Under Domain Shift EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection Learning from Temporal Gradient for Semi-supervised Action Recognition JoinABLe: Learning Bottom-up Assembly of Parametric CAD Joints DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion Defensive Patches for Robust Recognition in the Physical World UniCoRN: A Unified Conditional Image Repainting Network APES: Articulated Part Extraction from Sprite Sheets Learning Deep Implicit Functions for 3D Shapes with Dynamic Code Clouds Neural Rays for Occlusion-aware Image-based Rendering DisARM: Displacement Aware Relation Module for 3D Detection A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration RIM-Net: Recursive Implicit Fields for Unsupervised Learning of Hierarchical Shape Structures Weakly Supervised Object Localization as Domain Adaption Reflash Dropout in Image Super-Resolution Semantic Segmentation by Early Region Proxy EyePAD++: A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular Images Online Learning of Reusable Abstract Models for Object Goal Navigation Time Microscope: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion OSOP: A Multi-Stage One Shot Object Pose Estimation Framework Localization Distillation for Dense Object Detection RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs Cross-Image Relational Knowledge Distillation for Semantic Segmentation Trustworthy Long-tailed Classification Episodic Memory Question Answering REX: Reasoning-aware and Grounded Explanation Query and Attention Augmentation for Knowledge-Based Explainable Reasoning LOLNerf: Learn from One Look Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions CoNeRF: Controllable Neural Radiance Fields Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space UnweaveNet: Unweaving Activity Storiess MeMOT: Multi-Object Tracking with Memory VisualHow: Multimodal Problem Solving Affine Medical Image Registration with Coarse-to-Fine Vision Transformer Unpaired Deep Image Deraining Using Dual Contrastive Learning DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis Mask Transfiner for High-Quality Instance Segmentation GLASS: Geometric Latent Augmentation for Shape Spaces Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning Multi-modal Extreme Classification CodedVTR: Codebook-Based Sparse Voxel Transformer in Geometric Regions Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization Self-augmented Unpaired Image Dehazing via Density and Depth Decomposition QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection Cross-modal Representation Learning for Zero-shot Action Recognition Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation AUV-Net: Learning Aligned UV Maps for Texture Transfer and Synthesis Bijective Mapping Network for Shadow Removal ObjectFormer for Image Manipulation Detection and Localization GraFormer: Graph-oriented Transformer for 3D Pose Estimation Multi-Granularity Alignment Domain Adaptation for Object Detection Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors 3D Scene Painting via Semantic Image Synthesis MViTv2: Improved Multiscale Vision Transformers for Classification and Detection One-bit Active Query with Contrastive Pairs HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction Leveraging Object-Level Rotation Equivariance for 3D Object Detection DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting JIFF: Jointly-aligned Implicit Face Function for High Fidelity Single View Clothed Human Reconstruction Prompt Distribution Learning CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds Noisy Boundaries: Lemon or Lemonade for Semi-supervised Instance Segmentation? Interactive Image Synthesis with Panoptic Layout Generation Learning to Find Good Models in RANSAC Meta-attention for ViT-backed Continual Learning Deep Anomaly Discovery from Unlabeled Videos via Normality Advantage and Self-Paced Refinement Improving neural implicit surfaces geometry with patch warping Rope3D: Take A New Look from the 3D Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task AME: Attention and Memory Enhancement in Hyper-Parameter Optimization TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation Automated Progressive Learning for Efficient Training of Vision Transformers Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions Towards Implicit Text-Guided 3D Shape Generation Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation Revisiting skeleton-based action recognition Mutual Quantization for Cross-Modal Search with Noisy Labels Revisiting Temporal Alignment for Video Restoration Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities Video Frame Interpolation with Transformer Autofocus for Event Cameras Event-based Direct Sparse Odometry OpenTAL: Towards Open Set Temporal Action Localization Programmatic Concept Learning for Human Motion Description and Synthesis MAXIM: Multi-Axis MLP for Image Processing Temporal Alignment Networks for Long-term Video Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images Progressive End-to-End Object Detection in Crowded Scenes Object-aware Video-language Pre-training for Retrieval Multi-Source Uncertainty Mining for Deep Unsupervised Saliency Detection Surface Representation for Point Clouds Context-Aware Video Reconstruction for Rolling Shutter Cameras MonoScene: Monocular 3D Semantic Scene Completion Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts Point Cloud Color Constancy HDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging iPLAN: Interactive and Procedural Layout Planning End-to-End Multi-Person Pose Estimation with Transformers Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation Adversarial Eigen Attack on Black-Box Models Domain-Aware Representation Learning for Unsupervised Domain Generalization Sub-word Level Lip Reading With Visual Attention Efficient Video Instance Segmentation via Tracklet Query and Proposal Towards cross-modal pose localization from text-based position descriptions Opening up Open World Tracking Dynamic Clustering Mask Transformers for Panoptic Segmentation Compressive Single-Photon 3D Cameras Style-ERD: Responsive and Coherent Online Motion Style Transfer MixFormer: Mixing Features across Windows and Dimensions Robust Image Forgery Detection over Online Social Network Shared Images Semantic-aligned Fusion Transformer for One-shot Object Detection Long-term Video Frame Interpolation Via Feature Propagation Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection ETHSeg: An Amodel Instance Segmentation Network and a Real-world Dataset for X-Ray Waste Inspection SEEG: Semantic Energized Co-speech Gesture Generation Instance-Dependent Label-Noise Learning With Manifold-Regularized Transition Matrix Estimation Acquiring a Dynamic Light Field through a Single-Shot Coded Image How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting FaceVerse: a Fine-grained and Detail-changeable 3D Neural Face Model from a Hybrid Dataset Learning Where to Learn in Cross-View Self-Supervised Learning Automatic Relation-aware Graph Network Proliferation CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning Unsupervised Learning of Accurate Siamese Tracking Accelerating DETR Convergence via Semantic-Aligned Matching Co-advise: Cross Inductive Bias Distillation Medial Spectral Coordinates for 3D Shape Analysis Coupled Iterative Refinement for 6D Multi-Object Pose Estimation DeepCurrents: Learning Implicit Representations of Shapes with Boundaries Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation Day-to-Night Image Synthesis for Training Nighttime Neural ISPs Playable Environments: Video Manipulation in Space and Time Unified Contrastive Learning in Image-Text-Label Space Many-to-many Splatting for Efficient Video Frame Interpolation Uncertainty-Aware Deep Multi-View Photometric Stereo Multi-Robot Active Mapping via Neural Bipartite Graph Matching Location-free Human Pose Estimation Multiview Transformers for Video Recognition RIO: Rotation-equivariance supervised learning of robust inertial odometry Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment MiniViT: Compressing Vision Transformers with Weight Multiplexing Pop-Out Motion: 3D-Aware Image Deformation via Learning Shape Laplacian On the Road to Online Adaptation for Semantic Image Segmentation Generalized Binary Search Network for Highly-Efficient Multi-View Stereo Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation DLFormer:Discrete Latent Transformer for Video Inpainting Continuous Scene Representations for Embodied AI vCLIMB: A Novel Video Class Incremental Learning Benchmark NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration ONCE-3DLanes: Building Monocular 3D Lane Detection ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer HairMapper: Removing Hair from Portraits Using GANs Dist-PU: Positive-Unlabeled Learning from a Label Distribution Perspective Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection Interactive Multi-Class Tiny-Object Detection Generalizable Human Pose Triangulation Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking A Simple Episodic Linear Probe Improves Visual Recognition in the Wild Learning to Learn by Jointly Optimizing Neural Architecture and Weights Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning Learning Soft Estimator of Keypoint Scale and Orientation with Probabilistic Covariant Loss Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence Margin Cross Domain Object Detection by Target-Perceived Dual Branch Distillation Depth-Aware Generative Adversarial Network for Talking Head Video Generation OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data Improving Adversarially Robust Few-shot Image Classification with Generalizable Representations DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion Stable Long-Term Recurrent Video Super-Resolution Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization SelfD: Self-Learning Large-Scale Driving Policies From the Web InstaFormer: Instance-Aware Image-to-Image Translation with Transformer AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation GASP, a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation Exploring and Evaluating Image Restoration Potential in Dynamic Scenes Multi-level Feature Learning for Contrastive Multi-view Clustering Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data Threshold Matters in WSSS: Manipulating the Activation for the Robust and Accurate Segmentation Model Against Thresholds StyleSwin: Transformer-based GAN for High-resolution Image Generation Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery Splicing ViT Features for Semantic Appearance Transfer Optimizing Video Prediction via Video Frame Interpolation Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects HARA: A Hierarchical Approach for Robust Rotation Averaging Revisiting Weakly Supervised Pre-Training of Visual Perception Models Safe-Student for Safe Deep Semi-Supervised Learning with Unseen-Class Unlabeled Data PatchFormer: An Efficient Point Transformer with Patch Attention Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera with Global Reset Feature Conditional Prompt Learning for Vision-Language Models Stability-driven Contact Reconstruction From Monocular Color Images SharpContour: A Contour-based Boundary Refinement Approach for Efficient and Accurate Instance Segmentation MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning GeneralDepth: Unsupervised Learning of Single-Image Depth Estimation in General Scenes Revisiting AP Loss for Dense Object Detection: Adaptive Ranking Pair Selection No-Reference Point Cloud Quality Assessment via Domain Adaptation DArch: Dental Arch Prior-assisted 3D Tooth Instance Segmentation with Weak Annotations Self-Supervised Keypoint Discovery in Behavioral Videos Toward Practical Self-Supervised Monocular Indoor Depth Estimation Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image Synthesis Learning the Degradation Distribution for Blind Image Super-Resolution ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization Exploiting Rigidity Constraints for LiDAR Scene Flow Estimation Democracy Does Matter: Comprehensive Feature Mining for Co-Salient Object Detection Unsupervised Domain Adaptation for Nighttime Aerial Tracking UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation 3D Shape Reconstruction from 2D Images with Disentangled Attribute Flow Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer StyTr2: Image Style Transfer with Transformers BokehMe: When Neural Rendering Meets Classical Rendering Memory-augmented Deep Conditional Unfolding Network for Pan-sharpening Learning Object Context for Novel-view Scene Layout Generation FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment TCTrack: Temporal Contexts for Aerial Tracking RBGNet: Ray-based Grouping for 3D Object Detection 3PSDF: Three-Pole Signed Distance Function for Learning Surfaces with Arbitrary Topologies PanopticNeRF: A Semantic Object-Aware Neural Scene Representation Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera A Voxel Graph CNN for Object Classification with Event Cameras How Good Is Aesthetic Ability of a Fashion Model? Recurrent Dynamic Embedding for Video Object Segmentation Self-Distillation from the Last Mini-Batch for Consistency Regularization Group Contextualization for Video Recognition Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution Urban Radiance Fields Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images Global Sensing and Measurements Reuse for Image Compressed Sensing AKB-48: A Real-World Articulated Object Knowledge Base Structured Sparse R-CNN for Direct Scene Graph Generation Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing Spectral Unsupervised Domain Adaptation for Visual Recognition SimMatch: Semi-supervised Learning with Similarity Matching Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading POCO: Point Convolution for Surface Reconstruction HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and Beyond FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling and Correction Open-set Text Recognition via Character-Context Decoupling Generalized Few-shot Semantic Segmentation Causal Transportability for Neural Representations Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition Matching Feature Sets for Few-Shot Image Classification Interactron: Embodied Adaptive Object Detection It’s About Time: Analog Clock Reading in the Wild A Graph Matching Perspective with Transformers on Video Instance Segmentation GIF: Neural Implicit Function for General Shape Representation AdaViT: Adaptive Vision Transformers for Efficient Image Recognition Language as Queries for Referring Video Object Segmentation Federated Class-Incremental Learning Human Hands as Probes for Interactive Object Understanding STIF: Learning Continuous Video Representation for Space-Time Super-Resolution Bridging Video-text Retrieval with Multiple Choice Questions FoggyStereo: Stereo Matching with Fog Volume Representation MonoGround: Detecting Monocular 3D Objects from the Ground CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding Local Texture Estimator for Implicit Representation Function Neural Recognition of Dashed Curves with Gestalt Law of Continuity Voxel Field Fusion for 3D Object Detection Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization H4D: Human 4D Modeling by Learning Neural Compositional Representation PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer A Unified Query-based Paradigm for Point Cloud Understanding AdaInt: Learning Adaptive Intervals for 3D Lookup Tables on Real-time Image Enhancement FS6D: Few-Shot 6D Pose Estimation of Novel Objects CLIP-Event: Connecting Text and Images with Event Structures Category Contrast for Unsupervised Domain Adaptation in Visual Tasks GateHUB: Gated History Unit with Background Suppression for Online Action Detection MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video Learning 3D Object Shape and Layout without 3D Supervision Discrete Cosine Transform Network for Guided Depth Super-Resolution DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification Recurrent Glimpse-based Decoder for Detection with Transformer HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR Multi-Object Tracking Meets Moving UAV Estimating Fine-Grained Noise Model via Contrastive Learning ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues Task-specific Inconsistency Alignment for Domain Adaptive Object Detection Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization Global-Aware Registration of Less-Overlap RGB-D Scans XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font Generation A Simple Data Mixing Prior for Improving Self-Supervised Vision Transformer Dense Learning based Semi-Supervised Object Detection RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution End-to-end Generative Pretraining for Multimodal Video Captioning Exposure Normalization and Compensation for Multiple Exposure Correction Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks Multi-label Classification with Partial Annotations using Class-aware Selective Loss Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction Decoupling Makes Weakly Supervised Local Feature Better Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification Semi-Weakly-Supervised Learning of Complex Actions from Instructional Videos Set-Supervised Action Learning in Procedural Videos via Pairwise Order Consistency SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation BANMo: Building Animatable 3D Neural Models from Many Casual Videos HD-CSE: Learning Dense Correspondence of Clothed Humans with Vision Transformers Efficient Geometry-aware 3D Generative Adversarial Networks CAPRI-Net: Learning Compact CAD Shapes with Adaptive Primitive Assembly HL-Net: Heterophily Learning Network for Scene Graph Generation Towards Efficient Data Free Black-box Adversarial Attack Neural Collaborative Graph Machines for Table Structure Recognition Dimension Embeddings for Monocular 3D Object Detection Nested Collaborative Learning for Long-Tailed Visual Recognition Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels Calibrating Deep Neural Networks by Pairwise Constraints HybridCR: Weakly-Supervised 3D Point Cloud Semantic Segmentation via Hybrid Contrastive Regularization Few-Shot Font Generation by Learning Fine-Grained Local Styles Point-NeRF: Point-based Neural Radiance Fields Spatial-Temporal Space Hand-in-Hand: Spatial-Temporal Video Super-Resolution via Cycle-Projected Mutual Learning Learning from All Vehicles Gait Recognition in the Wild with Dense 3D Representations and A Benchmark DETReg: Unsupervised Pretraining with Region Priors for Object Detection Rethinking Semantic Segmentation: A Prototype View Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image Spatio-temporal Relation Modeling for Few-shot Action Recognition RestoreFormer: High-Quality Blind Face Restoration from Undegraded Key-Value Pairs DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis Domain-Agnostic Prior for Unsupervised Transfer Segmentation Unimodal-Concentrated Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression Pyramid Grafting Network for One-Stage High Resolution Saliency Detection Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning Semi-Supervised Video Semantic Segmentation with Inter-Frame Feature Reconstruction Revisiting the "Video" in Video-Language Understanding SNUG: Self-Supervised Neural Dynamic Garments FocalClick: Towards Practical Interactive Image Segmentation DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation Temporally Efficient Vision Transformer for Video Instance Segmentation C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image Adversarial Texture for Fooling Person Detectors in the Physical World Automatic Color Image Stitching Using Quaternion Rank-1 Alignment TemporalUV: Capturing Loose Clothing with Temporally Coherent UV Coordinates Kernelized Few-shot Object Detection by Integral Aggregation Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model FocusCut: Diving into a Focus View in Interactive Segmentation Mutual Information-driven Pan-sharpening Gradient-SDF: A Semi-Implicit Surface Representation for 3D Reconstruction Neural Head Avatars from Monocular RGB Videos Point-Level Region Contrast for Object Detection Pre-Training HODEC: Towards Efficient High-Order DEcomposed Convolutional Neural Networks Bridging Global Context Interactions for High-Fidelity Image Completion CDGNet: Class Distribution Guided Network for Human Parsing Primitive3D: Learning from 3D Objects Assembled with Random Primitives HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video TransMix: Attend to Mix for Vision Transformers JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection Few-shot Head Swapping in the Wild Neural Texture Extraction and Distribution for Controllable Person Image Synthesis Embracing Single Stride 3D Object Detector with Sparse Transformer Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data Expanding Low-Density Latent Regions for Open-Set Object Detection GMFlow: Learning Optical Flow via Global Matching Source-Free Domain Adaptation via Distribution Estimation Aesthetic Text Logo Synthesis via Content-aware Layout Inferring An Image Patch is a Wave: Phase-Aware Vision MLP FisherMatch: Semi-Supervised Rotation Regression via Entropy-based Filtering BE-STI: Spatial-Temporal Integrated Network for Class-agnostic Motion Prediction with Bidirectional Enhancement DC-SSL: Addressing Mismatched Class Distribution in Semi-supervised Learning Deterministic Point Cloud Registration via Novel Transformation Decomposition Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos Deep Visual Geo-localization Benchmark LC-FDNet: Learned Lossless Image Compression with Frequency Decomposition Network Towards Robust Vision Transformer Volumetric Bundle Adjustment for Photorealistic Real-time Reconstruction Continual Test-Time Domain Adaptation Scribble-Supervised LiDAR Semantic Segmentation TableFormer: Table Structure Understanding with Transformers Focal Sparse Convolutional Networks for 3D Object Detection CLRNet: Cross Layer Refinement Network for Lane Detection Transformer Based Line Segment Classifier with Image Context for Real-Time Vanishing Point Detection in Manhattan World NeRFReN: Neural Radiance Fields with Reflections HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing Ditto: Building Digital Twins of Articulated Objects from Interaction CroMo: Cross-Modal Learning for Monocular Depth Estimation Mobile-Former: Bridging MobileNet and Transformer MetaFormer is Actually What You Need for Vision RU-Net: Regularized Unrolling Network for Scene Graph Generation Dreaming to Prune Image Deraining Networks Salvage of Supervision in Weakly Supervised Object Detection Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning FMCNet: Feature-Level Modality Compensation for Visible-Infrared Person Re-Identification Generalizing Gaze Estimation with Rotation Consistency SIOD: Single Instance Annotated Per Category Per Image for Object Detection Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift Manifold Learning Benefits GANs Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing OW-DETR: Open-world Detection Transformer Learning Optimal K-space Acquisition and Reconstruction using Physics-Informed Neural Networks Global Tracking via Ensemble of Local Trackers Robust Region Feature Synthesizer for Zero-Shot Object Detection Confidence Propagation Cluster: Unleash Full Potential of Object Detectors PartGlot: Learning Shape Part Segmentation from Language Reference Games Self-Taught Metric Learning without Labels GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion 3D Common Corruptions and Data Augmentation DIVeR: Real-time and Accurate Neural Radiance Fields with Deterministic Integration for Volume Rendering Boosting Robustness of Image Matting with Context Assembling and Strong Data Augmentation Cross-modal Clinical Graph Transformer For Ophthalmic Report Generation Correlation-Aware Deep Tracking Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data Block-NeRF: Scalable Large Scene Neural View Synthesis Vector Quantized Diffusion Model for Text-to-Image Synthesis Boosting Crowd Counting via Multifaceted Attention Physically-guided Disentangled Implicit Rendering for 3D Face Modeling IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers Back to Reality: Weakly-supervised 3D Detection with Shape-guided Label Enhancement Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel Reduce Information Loss in Transformers for Pluralistic Image Inpainting OCSampler: Compressing Videos to One Clip with Single-step Sampling Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation High-resolution Face Swapping via Latent Semantics Disentanglement Deep Rectangling for Image Stitching: A Learning Baseline Detector-Free Weakly Supervised Group Activity Recognition Unsupervised Domain Generalization by learning a Bridge Across Domains RSCFed: Random Sampling Consensus Federated Semi-supervised Learning IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution Learned Queries for Efficient Local Attention Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance Capture Robust Contrastive Learning against Noisy Views Discovering Objects that Can Move TubeFormer-DeepLab: Video Mask Transformer Sparse and Complete Latent Organization for Geospatial Semantic Segmentation ITSA: An Information Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks Few-shot Backdoor Defense Using Shapley Estimation Exploring Domain-Invariant Parameters for Source Free Domain Adaptation Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition Likert Scoring with Grade Decoupling for Long-term Action Assessment Unpaired Cartoon Image Synthesis via Gated Cycle Mapping Contextual Instance Decoupling for Robust Multi-Person Pose Estimation Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes Modulated Contrast for Versatile Image Translation Oriented RepPoints for Aerial Object Detection INS-Conv: Incremental Sparse Convolution for Online 3D Segmentation PanopticDepth: Instance-Decoupled Depth Estimation for Unified Depth-Aware Panoptic Segmentation Point-BERT : Pre-Training 3D Point Cloud Transformers with Masked Point Modeling Implicit Sample Extension for Unsupervised Person Re-Identification Incorporating Semi-Supervised and Positive-Unlabeled learning for Boosting Full Reference Image Quality Assessment HairCLIP: Design Your Hair by Text and Reference Image C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object Detection MogFace: Towards a Deeper Appreciation on Face Detection RegionCLIP: Region-based Language-Image Pretraining HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network Structure-Aware Flow Generation for Human Body Reshaping Revisiting Document Image Dewarping by Grid Regularization GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation Align and Prompt: Video-and-Language Pre-training with Entity Prompts Bridging the Gap between Classification and Localization for Weakly Supervised Object Localization Shunted Self-Attention via Multi-Scale Token Aggregation VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer YouMVOS: An Actor-centric Multi-shot Video Object Segmentation Dataset Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection DiSparse: Disentangled Sparsification for Multitask Model Compression Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction Weakly Supervised High-Fidelity Clothing Model Generation Deep Generalized Unfolding Networks for Image Restoration Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework Iterative Deep Homography Estimation Homography Loss for Monocular 3D Object Detection Infrared Invisible Clothing: Hiding from Infrared Detectors at Multiple Angles in Real World Deep Stereo Image Compression via Bi-directional Coding Degree-of-linear-polarization-based Color Constancy Unleashing Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning with Pairwise Alignment Learning Transferable Human-Object Interaction Detector with Natural Language Supervision PNP: Robust Learning from Noisy Labels by Probabilistic Noise Prediction RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search Few-shot Keypoint Detection with Uncertainty Learning for Unseen Species Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation ``The Pedestrian next to the Lamppost'' Adaptive Object Graphs for Better Instantaneous Mapping Point2Seq: Detecting 3D Objects as Sequences Towards Noiseless Object Contours for Weakly Supervised Semantic Segmentation Syntax-Aware Network for Handwritten Mathematical Expression Recognition RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance Genres BNVF: Dense 3D Reconstruction using Bi-level Neural Volume Fusion AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework Cross-domain Few-shot Learning with Task-specific Adapters Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks Geometric and Textural Augmentation for Domain Gap Reduction Geometric Transformer for Fast and Robust Point Cloud Registration Group R-CNN for Point-based Weakly Semi-supervised Object Detection Wnet: Audio-Guided Video Semantic Segmentation via Wavelet-Based Cross-Modal Denoising Networks 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds ELSR: Efficient Line Segment Reconstruction with Planes and Points Guidance A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer End-to-End Referring Video Object Segmentation with Multimodal Transformers Neural fields as learnable kernels for 3D reconstruction IDR: Self-Supervised Image Denoising via Iterative Data Refinement TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization Deep vanishing point detection: Geometric priors make dataset variations vanish On Adversarial Robustness of Trajectory Prediction for Autonomous Vehicles Learning Multiple Dense Prediction Tasks from Partially Annotated Data Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free Video Demoireing with Relation-based Temporal Consistency FLAG: Flow-based 3D Avatar Generation from Sparse Observations Learning an Optimal Linear Program for Multi-Target Tracking IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images Stereoscopic Universal Perturbations across Different Architectures and Datasets The Flag Median and FlagIRLS NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images BoxeR: Box-Attention for 2D and 3D Transformers DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy Learning To Recognize Procedural Activities with Distant Supervision Audio-driven Neural Gesture Reenactment with Video Motion Graphs Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence Hire-MLP: Vision MLP via Hierarchical Rearrangement Escaping Data Scarcity for High-Resolution Heterogeneous Face Hallucination DeepDPM: Deep Clustering With an Unknown Number of Clusters ZeroWaste Dataset: Towards Deformable Object Segmentation in Cluttered Scenes Context-Aware Sequence Alignment using 4D Skeletal Augmentation COAP: Compositional Articulated Occupancy of People Sound and Visual Representation Learning with Multiple Pretraining Tasks The Wanderings of Odysseus in 3D Scenes Deblurring via Stochastic Refinement SMPL-A: Modeling Person-Specific Deformable Anatomy Neural Point Light Fields FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation Adversarial Parametric Pose Prior Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior Pre-Training meets Self-Training for Supersizing 3D Reconstruction Safe Self-Refinement for Transformer-based Domain Adaptation ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses Towards Multimodal Depth Estimation from Light Fields Deformable Sprites for Unsupervised Video Decomposition Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations Semi-supervised Semantic Segmentation with Error Localization Network Quantization-aware Deep Optics for Snapshot Hyperspectral Imaging Gravitationally Lensed Black Hole Emission Tomography Improving Video Model Transfer with Dynamic Representation Learning FWD: Real-time Novel View Synthesis with Forward Warping and Depth Enhancing Adversarial Training with Second-Order Statistics of Weights Patch Slimming for Efficient Vision Transformers 3DAC: Learning Attribute Compression for Point Clouds SNR-Aware Low-light Image Enhancement Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation Motion-modulated Temporal Fragment Alignment Network For Few-Shot Action Recognition Self-Supervised Bulk Motion Artifact Removal in Optical Coherence Tomography Angiography Salient-to-Broad Transition for Video Person Re-identification Which images to label for few-shot medical landmark detection? Hybrid Relation Guided Set Matching for Few-shot Action Recognition Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction Bringing Old Films Back to Life Face Relighting with Geometrically Consistent Shadows Learning Cloth-Irrelevant Features for Cloth-Changing Person Re-identification DPICT: Deep Progressive Image Compression Using Trit-Planes From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering Simple but Effective: CLIP Embeddings for Embodied AI Scene Consistency Representation Learning for Video Scene Segmentation Neural Data-Dependent Transform for Learned Image Compression CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation Global Matching with Overlapping Attention for Optical Flow Estimation Meta Agent Teaming Active Learning for Pose Estimation Robust Combination of Distributed Gradients Under Adversarial Perturbations Toward Fast, Flexible, and Robust Low-Light Image Enhancement Motion-aware Contrastive Video Representation Learning via Foreground-background Merging ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval L-Verse: Bidirectional Generation Between Image and Text GANORCON: Are Generative Models Useful for Few-shot Segmentation? Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation Towards Robust Adaptive Object Detection under Noisy Annotations Point2Cyl: Reverse Engineering 3D Objects -- from Point Clouds to Extrusion Cylinders MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation Subspace Adversarial Training Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation UniVIP: A Unified Framework for Self-Supervised Visual Pre-training MUM : Mix Image Tiles and UnMix Feature Tiles for Semi-Supervised Object Detection SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud On the Integration of Self-Attention and Convolution Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation Human Instance Matting via Mutual Guidance and Multi-Instance Refinement Delving Deep into the Generalization of Vision Transformers under Distribution Shifts Causality Inspired Representation Learning for Domain Generalization Learning Local Displacements for Point Cloud Completion Remember Intentions: Retrospective-Memory-based Trajectory Prediction Contextual Similarity Distillation for Asymmetric Image Retrieval Self-Supervised Models are Continual Learners High-Fidelity Human Avatars from a Single RGB Camera Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation TWIST: Two-Way Inter-label Self-Training for Semi-supervised 3D Instance Segmentation Focal length and object pose estimation via render and compare Kubric: A scalable dataset generator VRDFormer: End-to-End Video Visual Relation Detection with Transformers A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection Brain-inspired Multilayer Perceptron with Spiking Neurons Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection High Quality Segmentation for Ultra High-resolution Images Physically Disentangled Intra- and Inter-domain Adaptation for Varicolored Haze Removal HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network Future Transformer for Long-term Action Anticipation Decoupling Zero-Shot Semantic Segmentation Long-tail Recognition via Compositional Knowledge Transfer Open Challenges in Deep Stereo: the Booster Dataset BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations Recall@k Surrogate Loss with Large Batches and Similarity Mixup PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision Dynamic Dual-Output Diffusion Models End-to-End Human-Gaze-Target Detection with Transformers EMOCA: Emotion Driven Monocular Face Capture and Animation R(Det)$^2$: Randomized Decision Routing for Object Detection Diffusion Autoencoders: Toward a Meaningful and Decodable Representation PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition NeurMiPs: Neural Mixture of Planar Experts for View Synthesis Learning to generate line drawings that convey geometry and semantics AlignQ: Alignment Quantization with ADMM-based Correlation Preservation Learning Embodied Object-Search Strategies from 50k Human Demonstrations Longitudinal Self-Supervision for Learning 2D Amodal Representation Controllable Dynamic Multi-Task Architectures Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning Depth-supervised NeRF: Fewer Views and Faster Training for Free Learning to Detect Mobile Objects from LiDAR Scans Without Labels Revisiting Random Channel Pruning for Neural Network Compression ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation Learning sRGB-to-Raw De-rendering with Content-Aware Metadata SimVQA: Exploring Simulated Environments for Visual Question Answering Cross-Domain Adaptive Teacher for Object Detection Modality-Agnostic Learning for Radar-Lidar Fusion in Vehicle Detection A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture Holocurtains: Programming Light Curtains via Binary Holography Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy 3D human tongue reconstruction from single "in-the-wild" images Pushing the Performance Limit of Scene Text Recognizer without Human Annotation SAR-Net: Shape Alignment and Recovery Network for Category-level 6D Object Pose and Size Estimation Improving Subgraph Recognition with Variational Graph Information Bottleneck Towards Multi-domain Single Image Dehazing via Test-time Training EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching CHEX: CHannel EXploration for CNN Model Compression ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural Representations Deblur-NeRF: Neural Radiance Fields from Blurry images An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation Distribution Consistent Neural Architecture Search Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer Glass Segmentation using Intensity and Spectral Polarization Cues GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings Unsupervised Deraining: Where Contrastive Learning Meets Self-similarity Delving into the Estimation Shift of Batch Normalization in a Network Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light Full-Range Virtual Try-On with Recurrent Tri-Level Transformation Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks Protecting Celebrities from DeepFake with Identity Consistency Transformer SVIP: Sequence VerIfication for Procedures in Videos Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos Deep Saliency Prior for Reducing Visual Distraction ClothFormer: Taming Video Virtual Try-on in All Module FLARF: Fast LArge-scale Radiance Field Reconstruction Estimating Structural Disparities in Face Models Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval Scene Graph Expansion for Semantics-Guided Image Outpainting Deep Constrained Least Squares for Blind Image Super-Resolution MaskGIT: Masked Generative Image Transformer CMT: Convolutional Neural Networks Meet Vision Transformers GraftNet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature SoftGroup for 3D Instance Segmentation on Point Clouds Partial Class Activation Attention for Semantic Segmentation AnyFace: Free-style Text-to-Face Synthesis and Manipulation PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection Make It Move: Controllable Image-to-Video Generation with Text Descriptions Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels Learning What Not to Segment: A New Perspective on Few-Shot Segmentation TT-VSR: Learning Trajectory-Aware Transformer for Video Super-Resolution Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes DyRep: Bootstrapping Training with Dynamic Re-parameterization VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning GreedyNASv2: Greedier Search with a Greedy Path Filter HDR-NeRF: High Dynamic Range Neural Radiance Fields Novel-View Object Selection in Neural Volumetric Representations Relieving Long-tailed Instance Segmentation via Pairwise Class Balance Complex Video Action Reasoning via Learnable Markov Logic Network PCL: Proxy-based Contrastive Learning for Domain Generalization Unifying Motion Deblurring and Frame Interpolation with Events Shape-invariant 3D Adversarial Point Clouds Learning Pixel-Level Distinctions for Video Highlight Detection Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation ADAS: A Direct Adaptation Strategy for Multi-Target Domain Adaptive Semantic Segmentation PSTR: End-to-End One-Step Person Search With Transformers Towards real-world navigation with deep differentiable planners Multi-class Token Transformer for Weakly Supervised Semantic Segmentation Fourier Document Restoration for Robust Document Dewarping and Recognition Neural RGB-D Surface Reconstruction LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction What Matters For Meta-Learning Vision Regression Tasks? Self-supervised Learning of Adversarial Examples: Towards Good Generalizations for Deepfake Detection Ray Priors through Reprojection: Improving Neural Radiance Fields for Novel View Extrapolation Perception Prioritized Training of Diffusion Models Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving Human Trajectory Prediction with Momentary Observation General Facial Representation Learning in a Visual-Linguistic Manner Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model Contextual Outpainting with Object-level Contrastive Learning Optical Flow Estimation for Spiking Camera PointCLIP: Point Cloud Understanding by CLIP Large scale pre-training for person re-identification with noisy labels Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection Blended Diffusion for Text-driven Editing of Natural Images CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping Finding Fallen Objects Via Asynchronous Audio-Visual Integration HeadNeRF: A Real-time NeRF-Based Parametric Head Model Interacting Attention Graph for Single Image Two-Hand Reconstruction Learning based Multi-modality Image and Video Compression DR.VIC: Decomposition and Reasoning for Video Individual Counting End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection BaLeNAS: Differentiable Architecture Search via Bayesian Learning Rule Task Adaptive Parameter Sharing for Multi-Task Learning ViM: Out-Of-Distribution with Virtual-logit Matching Pyramid Adversarial Training Improves ViT Performance Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows Part-based Pseudo Label Refinement for Unsupervised Person Re-identification Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions Consistent Explanations by Constrastive Learning FvOR: Robust Joint Shape and Pose Optimization for Few-view Object Reconstruction Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision Frame Averaging for Equivariant Shape Space Learning iFS-RCNN: An Incremental Few-shot Instance Segmenter Bring Evanescent Representations to Life in Lifelong Class Incremental Learning Text to Image Generation with Semantic-Spatial Aware GAN Real-Time Light-Weight Near-Field Photometric Stereo DESTR: Object Detection with Split Transformer Backdoor Attacks on Self-Supervised Learning Diverse Image Outpainting via GAN Inversion High-Resolution Image Synthesis with Latent Diffusion Models NFormer: Robust Person Re-identification with Neighbor Transformer Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data SceneSqueezer: Learning to Compress Scene for Camera Relocalization Dancing under the stars: video denoising in starlight Tracking People by Predicting 3D Appearance, Location and Pose BCOT: A Markerless High-Precision 3D Object Tracking Benchmark Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture CVF-SID: Cyclic multi-Variate Function for Self-Supervised Image Denoising by Disentangling Noise from Image Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild BodyGAN: General-purpose Controllable Neural Human Body Generation Training-free Transformer Architecture Search Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification Single-Photon Structured Light Towards Practical Certifiable Patch Defense with Vision Transformer On Generalizing Beyond Domains in Cross-Domain Continual Learning Practical Learned Lossless JPEG Recompression with Multi-Level Cross-Channel Entropy Model in the DCT Domain GazeOnce: Real-Time Multi-Person Gaze Estimation RendNet: Unified 2D/3D Recognizer with Latent Space Rendering Identifying Ambiguous Similarity Conditions via Semantic Matching Learn from Others and Be Yourself in Heterogeneous Federated Learning Enhancing Face Recognition with Self-Supervised 3D Reconstruction Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video ACPL: Anti-curriculum Pseudo-labelling for Semi-supervised Medical Image Classification The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation Directional Self-supervised Learning for Heavy Image Augmentations CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild Cross-patch Dense Contrastive Learning for Semi-supervised Segmentation of Cellular Nuclei in Histopathologic Images Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition UCC: Uncertainty guided Cross-head Co-training for Semi-Supervised Semantic Segmentation Few-Shot Object Detection with Fully Cross-Transformer Exploiting Temporal Relations on Radar Perception for Autonomous Driving Unsupervised Visual Representation Learning by Online Constrained K-Means Contextual Debiasing for Visual Recognition with Causal Mechanisms Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes Towards Accurate Facial Landmark Detection via Cascaded Transformers DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow Critical Regularizations for Neural Surface Reconstruction in the Wild Per-Clip Video Object Segmentation CAFE: Learning to Condense Dataset by Aligning Features ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis SphereSR: 360° Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation Learning to Restore 3D Face from In-the-Wild Degraded Images BEVT: BERT Pretraining of Video Transformers A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion MSTR: Mutli-Scale Transformer for End-to-End Human-Object Interaction Detection Synthetic Aperture Imaging with Events and Frames AP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot Network Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information Lepard: Learning partial point cloud matching in rigid and deformable scenes Neural Compression-Based Feature Learning for Video Restoration Learning to Collaborate in Decentralized Learning of Personalized Models Rethinking Parsing Branch for Human Densepose Estimation Collaborative Transformers for Grounded Situation Recognition ISNet: Shape Matters for Infrared Small Target Detection Bi-level Doubly Variational Learning for Energy-based Latent Variable Models PSMNet: Position-aware Stereo Merging Network for Room Layout Estimation Bi-level Alignment for Cross-Domain Crowd Counting Unsupervised Homography Estimation with Coplanarity-Aware GAN Real-time Object Detection for Streaming Perception Neural Window Fully-connected CRFs for Monocular Depth Estimation Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing Shadows can be Dangerous: Stealthy and Effective Physical-world Adversarial Attack by Natural Phenomenon Towards Understanding Adversarial Robustness of Optical Flow Networks Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation A Continuous Video Generator with the Price, Quality and Perks of StyleGAN2 Self-Supervised Learning of Object Parts for Semantic Segmentation High-Resolution Image Harmonization via Collaborative Dual Transformations Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation FIFO: Learning Fog-invariant Features for Foggy Scene Segmentation Forecasting Characteristic 3D Poses of Human Actions Equalized Focal Loss for Dense Long-tailed Object Detection Style Neophile: Constantly Seeking Novel Styles for Domain Generalization Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-based 3D Hand Pose and Mesh Estimation The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation Correlation Verification for Image Retrieval Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection Multi-View Mesh Reconstruction with Neural Deferred Shading SoftCollage: A Differentiable Probabilistic Tree Generator for Image Collage OVE6D: Object Viewpoint Encoding For Depth-based 6D Object Pose Estimation Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection Image Disentanglement Autoencoder for Steganography without Embedding Gated2Gated: Self-Supervised Depth Estimation from Gated Images Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition DN-DETR: Accelerate DETR Training by Introducing Query DeNoising The Probabilistic Normal Epipolar Constraint for Frame-To-Frame Rotation Optimization under Uncertain Feature Positions A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching Enhancing Classifier Conservativeness and Robustness by Polynomiality Raw High-Definition Radar for Multi-Task Learning Self-Supervised Image Representation Learning with Geometric Set Consistency Multi-View Transformer for 3D Visual Grounding Semiconductor Defect Detection by Hybrid Classical-Quantum Deep Learning Attention Reveals Occlusions Revisiting Domain Generalized Stereo Matching Networks from a Feature Consistency Perspective Chi-transformer: Towards Reliable Stereo From Cues NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning SwapMix: Diagnosing and Regularizing the Over-reliance on Visual Context in Visual Question Answering Learning Part Segmentation through Unsupervised Domain Adaptation from Synthetic Vehicles CellTypeGraph: A New Geometric Computer Vision Benchmark Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning Reference-based Video Super-Resolution Using Multi-Camera Video Triplets End-to-End Semi-Supervised Learning for Video Action Detection Parameter-free Online Test-time Adaptation 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Dual-Key Multimodal Backdoors for Visual Question Answering Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective RePaint: Inpainting using Denoising Diffusion Probabilistic Models Improving GAN Equilibrium by Raising Spatial Awareness Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning A variational Bayesian method for similarity learning in non-rigid image registration Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data Adaptive Trajectory Prediction via Transferable GNN Learning to Learn across Diverse Data Biases in Deep Face Recognition RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding Total Variation Optimization Layers for Computer Vision Transforming Model Prediction for Tracking Human Mesh Recovery from Multiple Shots FastDOG: Fast Discrete Optimization on GPU Estimating Example Difficulty using Variance of Gradients Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation Scale-Equivalent Distillation for Semi-Supervised Object Detection Long-term Visual Map Sparsification with Heterogeneous GNN ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning Fast Point Transformer Sketch3T: Test-time Training for Zero-Shot SBIR Generative Flows with Invertible Attentions ABO: Dataset and Benchmarks for Real-World 3D Object Understanding A Dual Weighting Label Assignment Scheme for Object Detection ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts Explore the Spatio-temporal Aggregation for Insubstantial Object Detection:Benchmark Dataset and Baseline A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information DGECN: A Depth-Guided Edge Convolutional Network For End-to-End 6D Pose Estimation BNUDC: A Two-Branched Deep Neural Network for Restoring Images from Under-Display Cameras Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation Hallucinated Neural Radiance Fields in the Wild The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration Deep Depth from Focus with Differential Focus Volume Towards Layer-wise Image Vectorization Robust Federated Learning with Noisy and Heterogeneous Clients Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation Rethinking Spatial Invariance of Convolutional Networks for Object Counting Self-supervised Correlation Mining Network for Person Image Generation ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation Exploring Effective Data for Surrogate Training Towards Black-box Attack Contrastive Learning for Space-Time Correspondence via Self-cycle Consistency Accelerating Video Object Segmentation with Compressed Video Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo LISA: Learning Implicit Shape and Appearance of Hands GIQE: Generic Image Quality Enhancement via N$^{th}$ Order Iterative Degradation Continual Learning for Visual Search with Backward Consistent Feature Embedding STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes Differentiable Stereopsis: Meshes from multiple views using differentiable rendering ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation Arbitrary-Scale Image Synthesis CRIS: CLIP-Driven Referring Image Segmentation ShapeFormer: Transformer-based Shape Completion via Sparse Representation Quantifying Societal Bias Amplification in Image Captioning Omni-DETR: Omni-Supervised Object Detection with Transformers XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding Cross-Architecture Self-supervised Video Representation Learning Feature Erasing and Diffusion Network for Occluded Person Re-Identification Styleformer: Transformer based Generative Adversarial Networks with Style Vector A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty 360-Attack: Distortion-Aware Perturbations from Perspective-Views CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing NICE-SLAM: Neural Implicit Scalable Encoding for SLAM FIBA: Frequency-Injection based Backdoor Attack in Medical Image Analysis Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification Continual Predictive Learning from Videos BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning Learning to Zoom Inside Camera Imaging Pipeline TeachAugment: Data Augmentation Optimization Using Teacher Knowledge PhyIR: Physics-based Inverse Rendering for Panoramic Indoor Images Finding Good Configurations of Planar Primitives in Unorganized Point Clouds Towards Better Understanding Attribution Methods B-cos Networks: Alignment is All We Need for Interpretability TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed Learning Invisible Markers for Hidden Codes in Offline-to-online Photography Learning Distinctive Margin toward Active Domain Adaptation Adiabatic Quantum Computing for Multi Object Tracking Learnable Lookup Table for Neural Network Quantization Artistic Style Discovery With Independent Components Occlusion-Aware Cost Constructor for Light Field Depth Estimation Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning Which Model to Transfer? Finding the Needle in the Growing Haystack Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction Neural Points: Point Cloud Representation with Neural Fields C$^2$AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation RCP: Recurrent Closest Point for Point Cloud Label, Verify, Correct: A Simple Few-Shot Object Detection Method Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction Dual-Generator Face Reenactment BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering Balanced Contrastive Learning for Long-Tailed Visual Recognition The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Pose-aware Convolution Partially Does It: Towards Scene-Level FG-SBIR with Partial Input Source-Free Object Detection by Learning to Overlook Domain Style Region-Aware Face Swapping COOPERNAUT: End-to-End Driving with Cooperative Perceptionfor Networked Vehicles NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters Efficient Large-scale Localization by Global Instance Recognition All-photon Polarimetric Time-of-Flight Imaging Parametric Scattering Networks MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering Coarse-to-Fine Feature Mining for Video Semantic Segmentation Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality Rethinking Visual Geo-localization for Large-Scale Applications Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps Balanced and Hierarchical Relation Learning for One-shot Object Detection High-Fidelity GAN Inversion for Image Attribute Editing Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC I M Avatar: Implicit Morphable Head Avatars from Videos Proactive Image Manipulation Detection Text Spotting Transformers Learning a Structured Latent Space for Unsupervised Point Cloud Completion PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models Grounding Answers for Visual Questions Asked by Visually Impaired People Efficient Classification of Very Large Images with Tiny Objects Leveraging Adversarial Examples to Quantify Membership Information Leakage Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks When to Prune? A Policy towards Early Structural Pruning Robust Optimization as Data Augmentation for Large-scale Graphs Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis Harmony: A Generic Unsupervised Approach for Disentangling Semantic Content from Parameterized Transformations The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement Noise2NoiseFlow: Realistic Camera Noise Modeling without Clean Images MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision Virtual Elastic Objects StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning Self-supervised Neural Articulated Shape and Appearance Models A Self-Supervised Descriptor for Image Copy Detection Rethinking Deep Face Restoration Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes Rethinking Controllable Variational Autoencoders Convolutions for Spatial Interaction Modeling Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization AdaFace: Quality Adaptive Margin for Face Recognition Towards End-to-End Unified Scene Text Detection and Layout Analysis Active Learning by Feature Mixing Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs Towards Better Plasticity-Stability Trade-off in Incremental Learning: A Simple Linear Connector Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing Learning to Answer Questions in Dynamic Audio-Visual Scenarios Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis Knowledge-Driven Self-Supervised Representation Learning for Facial Action Unit Recognition Coupling Vision and Proprioception for Navigation of Legged Robots URetinex-Net: Retinex-based Deep Unfolding Network for Low-light Image Enhancement Modeling Image Composition for Complex Scene Generation Think Twice Before Detecting GAN-generated Fake Images from their Spectral Domain Imprints Undoing the Damage of Label Shift for Cross-domain Semantic Segmentation Implicit Motion Handling for Video Camouflaged Object Detection Contrastive Conditional Neural Processes Exploring Set Similarity for Dense Self-supervised Representation Learning E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential Equations Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining CycleMix: A Holistic Strategy for Medical Image Segmentation from Scribble Supervision Mixed Multimodal Tokens for Vision Transformers Rethinking the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance with Expanded Views AirObject: A Temporally Evolving Graph Embedding for Object Identification Balanced Multimodal Learning via On-the-fly Gradient Modulation Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization Computing Wasserstein-$p$ Distance Between Images with Linear Cost Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video Feature Statistics Mixing Regularization for Generative Adversarial Networks Expressive Talking Head Generation with Granular Audio-Visual Control Geometric Anchor Correspondence Mining with Uncertainty Modelling for Universal Domain Adaptation OSSO: Obtaining Skeletal Shape from Outside How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs GIRAFFE HD: A High-Resolution 3D-aware Generative Model Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism Pixel screening based intermediate correction for blind deblurring LAS-AT: Adversarial Training with Learnable Attack Strategy Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes Moving Window Regression: A Novel Approach to Ordinal Regression SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration APRIL: Finding the Achilles' Heel on Privacy Leakage for Vision Transformers Eigencontours: Novel Contour Descriptors Based on Low-Rank Approximation Cross-modal Background Suppression for Audio-Visual Event Localization WebQA: Multihop and Multimodal QA Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation Active Learning for Open-set Annotation E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation Relative Pose from a Calibrated and an Uncalibrated Smartphone Image Learning Optical Flow with Kernel Patch Attention Contrastive Learning for Unsupervised Video Highlight Detection ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Videos Similarity Evaluation Discrete time convolution for fast event-based stereo Proper Reuse of Image Classification Features Improves Object Detection Object-Region Video Transformers Vision-Language Pre-Training for Boosting Scene Text Detectors Bandits for Structure Perturbation-based Black-box Attacks to Graph Neural Networks with Theoretical Guarantees Revisiting Large Kernel Design in Convolutional Neural Networks Generating High Fidelity Data from Low-density Regions using Diffusion Models Colar: Effective and Efficient Online Action Detection by Consulting Exemplars Learning Visual-Semantic Explanations of Deep Visual Latent Representations StyleMesh: Style Transfer for Indoor 3D Scene Reconstructions Probing Representation Forgetting in Supervised and Unsupervised Continual Learning Light Field Neural Rendering ROCA: Robust CAD Model Retrieval and Alignment from a Single Image Pix2NeRF: Unsupervised Conditional pi-GAN for Single Image to Neural Radiance Fields Translation Non-Iterative Recovery from Nonlinear Observations using Generative Models Forecasting from LiDAR via Future Object Detection Towards Total Recall in Industrial Anomaly Detection Low-Resource Adaptation for Personalized Co-Speech Gesture Generation Integrating Language Guidance into Vision-based Deep Metric Learning Non-isotropy Regularization for Proxy-based Deep Metric Learning Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision Less is More: Generating Grounded Navigation Instructions from Landmarks Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis Performance-Aware Mutual Knowledge Distillation for Improving Neural Architecture Search End-to-End Reconstruction-Classification Learning for Face Forgery Detection UKPGAN: A General Self-Supervised Keypoint Detector C2SLR: Consistency-enhanced Continuous Sign Language Recognition Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution Style Transformer for Image Inversion and Editing Uformer: A General U-Shaped Transformer for Image Restoration Speech Driven Tongue Animation DO-GAN: A Double Oracle Framework for Generative Adversarial Networks IntentVizor: Towards Generic Query Guided Interactive Video Summarization Self-supervised Deep Image Restoration via Adaptive Stochastic Gradient Langevin Dynamics Sound-Guided Semantic Image Manipulation Adaptive Gating for Single-Photon 3D Imaging Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection GaTector: A Unified Framework for Gaze Object Prediction Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation Anomaly Detection via Reverse Distillation from One-Class Embedding Dynamic 3D Gaze from Afar: Deep Gaze Estimation from Temporal Eye-Head-Body Coordination Maximum Consensus by Weighted Influences of Monotone Boolean Functions Beyond Fixation: Dynamic Window Visual Transformer Dressing in the Wild by Watching Dance Videos Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers Contrastive Boundary Learning for Point Cloud Segmentation Proto2Proto: Can you recognize the car, the way I do? Bridged Transformer for Vision and Point Cloud 3D Object Detection V2C: Visual Voice Cloning An Efficient Training Approach for Very Large Scale Face Recognition SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition Task Discrepancy Maximization for Fine-grained Few-Shot Classification Reflection and Rotation Symmetry Detection via Equivariant Learning Self-Supervised Equivariant Learning for Oriented Keypoint Detection Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input 3DeformRS: Certifying Spatial Deformations on Point Clouds DiGS : Divergence guided shape implicit neural representation for unoriented point clouds UNICON: Combating Label Noise Through Uniform Selection and Contrastive Learning Vision Transformer with Deformable Attention Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation Industrial Style Transfer with Large-scale Geometric Warping and Content Preservation Hierarchical Modular Network for Video Captioning Optimal LED Spectral Multiplexing for NIR2RGB Translation Exploring Frequency Adversarial Attacks for Face Forgery Detection LAR-SR: A Local Autoregressive Model for Image Super Resolution What do navigation agents learn about their environment? HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation Entropy-based Active Learning for Object Detection with Progressive Diversity Constraint Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation Swin Transformer V2: Scaling Up Capacity and Resolution Knowledge Distillation via the Target-aware Transformer Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources Exemplar-based Pattern Synthesis with Implicit Periodic Field Network RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds by Local Rigidity Prior Weakly Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition Ego4D: Around the World in 3,000 Hours of Egocentric Video Spiking Transformers for Event-based Single Object Tracking Few-Shot Incremental Learning for Label-to-Image Translation CD^2-pFed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated Learning OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization Speed up Object Detection on Gigapixel-level Image with Patch Arrangement Learning Adaptive Warping for Real-World Rolling Shutter Correction Robust and Accurate Superquadric Recovery: a Probabilistic Approach SimVP: Simpler yet Better Video Prediction Hyperspherical Consistency Regularization Dense Depth Priors for Neural Radiance Fields from Sparse Input Views HyperInverter: Improving StyleGAN Inversion via Hypernetwork Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection Whose Hands are These? Hand Detection and Hand-Body Association in the Wild Blind Face Restoration via Integrating Face Shape and Generative Priors Multimodal Material Segmentation Do explanation methods explain? Model knows best Deep Hybrid Models for Out-of-Distribution Detection Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetics Detecting Camouflaged Object in Frequency Domain Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection Appearance and Structure Aware Robust Deep Visual Graph Matching: Attack, Defense and Beyond PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects HINT: Hierarchical Neuron Concept Explainer Vox2Cortex: Fast Explicit Reconstruction of Cortical Surfaces from 3D MRI Scans with Geometric Deep Neural Networks Generative Cooperative Learning for Unsupervised Video Anomaly Detection Panoptic, Instance and Semantic Relations: A Relational Context Encoder to Enhance Panoptic Segmentation Object-Relation Reasoning Graph for Action Recognition Lifelong Graph Learning A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search Rethinking Minimal Sufficient Representation in Contrastive Learning Physical Simulation Layer for Accurate 3D Modeling Image Animation with Perturbed Masks Sparse to Dense Dynamic 3D Facial Expression Generation AIM: an Auto-Augmenter for Images and Meshes PlanarRecon: Real-time 3D Plane Detection and Reconstruction from Posed Monocular Videos Modular Action Concept Grounding in Semantic Video Prediction Generating Representative Samples for Few-Shot Classification SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings Sequential Voting with Relational Box Fields for Active Object Detection Are Multimodal Transformers Robust to Missing Modality? Debiased Learning from Naturally Imbalanced Pseudo-Labels Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos Learning to deblur using light field generated and real defocus images TOAD: Topologically-Aware Deformation Fields for Single-view 3D Reconstruction An Empirical Study of Training End-to-End Vision-and-Language Transformers PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference Imposing Consistency for Optical Flow Estimation Generating Diverse 3D Reconstructions from a Single Occluded Face Image RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks 3D Moments from Near-Duplicate Photos CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation MatteFormer: Transformer-Based Image Matting via Prior-Tokens Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes Learning Bayesian Sparse Networks with Full Experience Replay for Continual Learning Category-Aware Transformer Network for Better Human-Object Interaction Detection Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way UNIST: Unpaired Neural Implicit Shape-to-Shape Translation REGTR: End-to-end Point Cloud Correspondences with Transformers Show, Deconfound and Tell: Image Captioning with Causal Inference DeepFake Disrupter: The Detector of DeepFake Is My Friend Lite Vision Transformer with Enhanced Self-Attention Bi-directional Object-context Prioritization Learning for Saliency Ranking OSKDet: Orientation-sensitive Keypoint Localization for Rotated Object Detection Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification Invariant Grounding for Video Question Answering Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning Learning Robust Image-Based Rendering on Sparse Scene Geometry via Depth Completion FENeRF: Face Editing in Neural Radiance Fields A Probabilistic Graphical Model Based on Neural-symbolic Reasoning for Visual Relationship Detection CVNet: Contour Vibration Network for Building Extraction What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design ABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution Photo Does Robustness on ImageNet Transfer to Downstream Tasks? Crowd Counting in the Frequency Domain SimMIM: A Simple Framework for Masked Image Modeling GrainSpace: A Large-scale Dataset for Fine-grained and Domain-adaptive Recognition of Cereal Grains End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps MPViT : Multi-Path Vision Transformer for Dense Prediction Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer ARCS: Accurate Rotation and Correspondence Search Ranking Distance Calibration for Cross-Domain Few-Shot Learning MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning Fisher Information Guidance for Learned Time-of-Flight Imaging Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer MotionAug: Augmentation with Physical Correction for Human Motion Prediction Deep Color Consistent Network for Low-Light Image Enhancement Non-Probability Sampling Network for Stochastic Human Trajectory Prediction GCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors Improving Adversarial Transferability via Neuron Attribution-Based Attacks HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction Pooling Revisited: Your Receptive Field is Sub-optimal Compressing Models with Few Samples: Mimicking then Replacing Shape from Thermal Radiation: Passive Ranging Using Multi-spectral LWIR Measurements Layered Depth Refinement with Mask Guidance Highly-efficient Incomplete Large-scale Multi-view Clustering with Consensus Bipartite Graph Scaling Up Vision-Language Pretraining for Image Captioning Optimal Correction Cost for Object Detection Evaluation Deformable Video Transformer High-fidelity Monocular Human Reconstruction by Combining Implicit and Explicit Representations Nonlocal Sparse CRF Long-Short Temporal Contrastive Learning of Video Transformers QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation All-In-One Image Restoration for Unknown Corruption Learning to Detect Scene Landmarks for Camera Localization WildNet: Learning Domain Generalized Semantic Segmentation from the Wild Pushing the Envelope of Gradient Boosting Forests via Globally-Optimized Oblique Trees Egocentric Scene Understanding via Multimodal Spatial Rectifier OSSGAN: Open-Set Semi-Supervised Image Generation Large-scale Video Panoptic Segmentation in the Wild: A Benchmark Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search Stereo Depth from Events Cameras: Concentrate and Focus on the Future Transferable Sparse Adversarial Attack FAM: Visual Explanations for the Feature Representations from Deep Convolutional Networks Noise-Aware NeRFs for Burst-Denoising Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds Bayesian Invariant Risk Minimization Extracting Triangular 3D Models, Materials, and Lighting From Images RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution SphericGAN: Semi-supervised Hyper-spherical Generative Adversarial Networks for Fine-grained Image Synthesis LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition Unifying Panoptic Segmentation for Autonomous Driving VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Interspace Pruning: Using Adaptive Filter Representations to Improve Training of Sparse CNNs NightLab: A Dual-level Architecture with Hardness Detection for Segmentation at Night Learning to Memorize Feature Hallucination for One-Shot Image Generation FedCorr: Multi-Stage Federated Learning for Label Noise Correction GeoNeRF: Generalizing NeRF with Geometry Priors Neural 3D Video Synthesis TransforMatcher: Match-to-Match Attention for Semantic Correspondence Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase. Burst Image Restoration and Enhancement Modeling Indirect Illumination for Inverse Rendering Knowledge Mining with Scene Text for Fine-Grained Recognition FlexIT: Towards Flexible Semantic Image Translation Surpassing the Human Accuracy: Detecting Gallbladder Cancer from USG Images with Curriculum Learning More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning Multi-Person Extreme Motion Prediction Does text attract attention on e-commerce images: A novel saliency prediction dataset and method Instance-Aware Dynamic Neural Network Quantization Energy-based Latent Aligner for Incremental Learning Semi-supervised Video Paragraph Grounding with Contrastive Encoder Personalized Image Aesthetics Assessment with Rich Attributes Attention Concatenation Volume for Accurate and Efficient Stereo Matching Split Hierarchal Variational Compression MS2DG-Net: Progressive Correspondence Learning via Multi Sparse Semantic Dynamic Graph Large Loss Matters in Weakly Supervised Multi-Label Classification Recurring the Transformer for Video Action Recognition Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning Hyperbolic Vision Transformers: Combining Improvements in Metric Learning Camera Pose Estimation using Implicit Distortion Models A Structured Dictionary Perspective on Implicit Neural Representations ST-MFNet: A Spatio-Temporal Multi-Flow Network for Frame Interpolation Geometric Structure Preserving Warp for Natural Image Stitching Slimmable Domain Adaptation Meta Convolutional Neural Networks for Single Domain Generalization Label Matching Semi-Supervised Object Detection Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning Abandoning the Bayer-Filter to See in the Dark Deep Hierarchical Semantic Segmentation MixFormer: End-to-End Tracking with Iterative Mixed Attention ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics Occlusion-robust Face Alignment using A Viewpoint-invariant Hierarchical Network Architecture Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic Segmentation STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction Boosting 3D Object Detection by Simulating Multimodality on Point Clouds RADU: Ray-Aligned Depth Update Convolutions for ToF Data Denoising Auto-Encoder is All You Need Whose Track Is It Anyway? Improving Robustness to Tracking Errors with Affinity-Based Prediction Multi-marginal Contrastive Learning for Multi-label Subcellular Protein Localization Stand-Alone Inter-Frame Attention in Video Models Hyperbolic Image Segmentation RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization ART-Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial Rotation Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3) Learning to Learn and Remember Super Long Multi-Domain Task Sequence Noise Is Also Useful: Negative Correlation-Steered Latent Contrastive Learning FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model Real World Self-Supervised Multi-Image Super-Resolution for Multi-Exposure Push-Frame Satellites Knowledge Distillation with the Reused Teacher Classifier Geometry-Aware Guided Loss for Deep Crack Recognition AdaMixer: A Simple and Accurate Query-based Object Detector Learning Structured Gaussians to Approximate Deep Ensembles Input-level Inductive Biases for 3D Reconstruction BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild Stereo Magnification with Multi-Layer Images Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection Coherent Point Drift Revisited for Non-rigid Shape Matching and Registration Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters Text2Mesh: Text-Driven Neural Stylization for Meshes RFNet: Unsupervised Network for Mutually Reinforcing Multi-modal Image Registration and Fusion Image Dehazing Transformer with Transmission-Aware 3D Position Embedding Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification RGB-Multispectral Matching: Dataset, Learning Methodology, Evaluation Maintaining Reasoning Consistency in Compositional Visual Question Answering PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite Images Fast Algorithm for Low-rank Tensor Completion in Delay-embedded Space Dynamic Sparse R-CNN Improving Robustness Against Stealthy Weight Bit-Flip Attacks by Output Code Matching NPBG++: Accelerating Neural Point-Based Graphics Forward Compatible Few-Shot Class-Incremental Learning Weakly-supervised Metric Learning with Cross-Module Communications for the Classification of Anterior Chamber Angle Images Learning Canonical F-Correlation Projection for Compact Multiview Representation Learning Non-target Knowledge for Few-shot Semantic Segmentation Towards Low-Cost and Efficient Malaria Detection PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking NeuralHDHair: Automatic High-fidelity Hair Modeling from a Single Image Using Implicit Neural Representations ClusterGNN: Cluster-based Coarse-to-fine Graph Neural Network for Efficient Feature Matching An Iterative Quantum Approach for Transformation Estimation from Point Sets ATPFL: Automatic Trajectory Prediction Model Design under Federated Learning Framework Understanding and Increasing Efficiency of Frank-Wolfe Adversarial Training Targeted Supervised Contrastive Learning for Long-Tailed Recognition Optimizing Elimination Templates by Greedy Parameter Search M3T: three-dimensional Medical image classifier using Multi-plane and Multi-slice Transformer Projective Manifold Gradient Layer for Deep Rotation Regression PUMP: Pyramidal and Uniqueness Matching Priors for Unsupervised Learning of Local Descriptors Deep orientation-aware functional maps : Tackling symmetry issues in Shape Matching A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation Lite-MDETR: A Lightweight Multi-Modal Detector Cross Modal Retrieval with Querybank Normalisation On Learning Contrastive Representations for Learning with Noisy Labels Cross-view transformers for real-time map-view semantic segmentation Towards Data-Free Model Stealing in a Hard Label Setting The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting Unseen Classes at a Later Time? No Problem Channel Balancing for Accurate Quantization of Winograd Convolutions Instance masks are what you need: Segmentation parity from object boundaries TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing Scanline Homographies for Rolling-Shutter Plane Absolute Pose Dual-Shutter Optical Vibration Sensing DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and Rendering Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients TubeR: Tubelet Transformer for Video Action Detection Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization Contour-Hugging Heatmaps for Landmark Detection Local Attention Pyramid for Scene Image Generation Implicit Feature Decoupling with Depthwise Quantization InsetGAN for Full-Body Image Generation Recurrent Variational Network: A Deep Learning Inverse Problem Solver applied to the task of Accelerated MRI Reconstruction Robust Invertible Image Steganography Disentangling visual and written concepts in CLIP Causal CLIP Fine-tuning for Fashion Product Retrieval Accelerating Neural Network Optimization Through an Automated Control Theory Lens Comprehending and Ordering Semantics for Image Captioning Grounded Language-Image Pre-training Hierarchical Self-supervised Representation Learning for Movie Understanding RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention How Well Do Sparse ImageNet Models Transfer? Towards Principled Disentanglement for Domain Generalization Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition Path-CNN: Topology-Aware Centerline Segmentation Using Sparse Annotation Image Based Reconstruction of Liquids from 2D Surface Detections Neural Convolutional Surfaces Graph-context Attention Networks for Size-varied Deep Graph Matching Learning to Solve Hard Minimal Problems Neural Mesh Simplification SPAct: Self-supervised Privacy Preservation for Action Recognition Towards Language-free Training for Text-to-Image Generation Rep-Net: Efficient On-Device Learning via Feature Reprogramming 3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection TrackFormer: Multi-Object Tracking with Transformers Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes EnvEdit: Environment Editing for Vision-and-Language Navigation DeepFace-EMD: Re-ranking using Patch-wise Earth Mover's Distance Improves Out-of-Distribution Face Identification Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs MulT: An End-to-End Multitask Learning Transformer Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework Plenoxels: Radiance Fields without Neural Networks Pushing the Limits of Simple Pipelines for Practical Few-Shot Learning PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning 3D Photo Stylization: Learning to Generate Stylized Novel Views from a Single Image SIMBAR: Single Image-Based Scene Relighting For Effective Data Augmentation For Automated Driving Vision Tasks VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks VALHALLA: Visual Hallucination for Machine Translation Learning Pairwise Affinity for Open-World Instance Segmentation CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning Generalized Category Discovery Deep Image-based Illumination Harmonization Mixed Differential Privacy in Computer Vision MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog Weakly Supervised Rotation-Invariant Aerial Object Detection Network Evaluation-oriented Knowledge Distillation for Deep Face Recognition Robust Cross-Modal Representation Learning with Progressive Self-Distillation Transformer Tracking with Cyclic Shifting Window Attention LTP: Lane-based Trajectory Prediction for Autonomous Driving Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers Multi-instance Point Cloud Registration by Efficient Correspondence Clustering AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition AutoLoss-GMS: Searching Generalized Margin-based Softmax Loss Function for Person Re-identification Convolution of Convolution: Let Kernels Spatially Collaborate DiffPoseNet: Direct Differentiable Camera Pose Estimation Modeling sRGB Camera Noise with Normalizing Flows Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis Federated Learning with Position-Aware Neurons Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation Point Density-Aware Voxels for LiDAR 3D Object Detection A Conservative Approach for Unbiased Learning on Unknown Biases The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed Classification Symmetry-aware Neural Architecture for Embodied Visual Exploration DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers Egocentric Prediction of Action Target in 3D What makes transfer learning work for medical images: feature reuse & other factors Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification Unsupervised Learning of De-biased Representation with Pseudo-bias Attribute DECORE: Deep Compression with Reinforcement Learning RGB-Depth Fusion GAN for Indoor Depth Completion MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound Class-Aware Contrastive Semi-Supervised Learning Learning to Prompt for Continual Learning DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints Self-Supervised Dense Consistency Regularization for Image-to-Image Translation Forward Compatible Training for Large-Scale Embedding Retrieval Systems Joint Forecasting of Panoptic Segmentations with Difference Attention Revisiting the Transferability of Supervised Pretraining: an MLP Perspective Disentangling Visual Embeddings for Attributes and Objects SeeThroughNet: Resurrection of Auxiliary Loss by Preserving Class Probability Information Neural Reflectance for Shape Recovery with Shadow Handling Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow XYDeblur: Divide and Conquer for Single Image Deblurring ScePT: Scene-consistent, Policy-based Trajectory Predictions for Planning Visual Acoustic Matching Fair Contrastive Learning for Facial Attribute Classification Neural Prior for Trajectory Estimation AutoMine: An Unmanned Mine Dataset SMARTADAPT: Multi-branch Object Detection Framework for Videos on Mobiles Neural Face Identification in a 2D Wireframe Projection of a Manifold Object AlignMixup: Improving Representations By Interpolating Aligned Features Memory-Augmented Non-Local Attention for Video Super-Resolution ESCNet: Gaze Target Detection with the Understanding of 3D Scenes AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation Distinguishing Unseen from Seen for Generalized Zero-shot Learning When Does Contrastive Visual Representation Learning Work? Privacy-preserving Online AutoML for Domain-Specific Face Detection Robust outlier detection by de-biasing VAE likelihoods GridShift: A Faster Mode-seeking Algorithm for Image Segmentation and Object Tracking Continual Learning with Lifelong Vision Transformer M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction Stochastic Variance Reduced Ensemble Adversarial Attack for Boosting the Adversarial Transferability Representing 3D Shapes with Probabilistic Directed Distance Fields Restormer: Efficient Transformer for High-Resolution Image Restoration Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification Few-shot Learning with Noisy Labels Co-Domain Symmetry for Complex-Valued Deep Learning Pyramid Architecture for Multi-Scale Processing in Point Cloud Segmentation GCR: Gradient Coreset based Replay Buffer Selection for Continual Learning Domain Adaptation on Point Clouds via Geometry-Aware Implicits Ranking-Based Siamese Visual Tracking Coarse-to-Fine Disentangling Transformer for Human-Object Interaction Detection MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural Networks DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation DTA: Physical Camouflage Attacks using Differentiable Transformation Network Layer-wised Model Aggregation for Personalized Federated Learning Video Swin Transformer Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries General Incremental Learning with Domain-aware Categorical Representations Crafting Better Contrastive Views for Siamese Representation Learning A Style-aware Discriminator for Controllable Image Translation BoosterNet: Improving Domain Generalization of Deep Neural Nets using Culpability-Ranked Features A Unified Framework for Implicit Sinkhorn Differentiation Brain-Supervised Image Editing Neural Shape Mating: Self-Supervised Object Assembly with Adversarial Shape Priors Multimodal Colored Point Cloud to Image Alignment Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction Multi-Objective Diverse Human Motion Prediction with Knowledge Distillation Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart Autoregressive Image Generation using Residual Quantization SGTR: End-to-end Scene Graph Generation with Transformer Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer PPDL: Predicate Probability Distribution based Loss for Unbiased Scene Graph Generation Localized Adversarial Domain Generalization Patch-level Representation Learning for Self-supervised Vision Transformers KNN Local Attention for Image Restoration Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework DAD-3DHeads: A Large-scale Dense, Accurate and Diverse Dataset for 3D Dense Head Alignment from a Single Image Is Mapping Necessary for Realistic PointGoal Navigation? Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation LiT: Zero-Shot Transfer with Locked-image text Tuning Scaling Vision Transformers Spatial Commonsense Graph for Object Localisation in Partial Scenes Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos Upright-Net: Learning Upright Orientation for 3D Point Cloud D*-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection Differentiable Dynamics for Articulated 3d Human Motion Reconstruction Clean Implicit 3D Structure from Noisy 2D STEM Images MPC: Multi-view Probabilistic Clustering Node-aligned Graph Convolutional Network for Whole-slide Image Representation and Classification Multidimensional Belief Quantification for Label-Efficient Meta-Learning Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection Uni6D: A Unified CNN Framework without Projection Breakdown in 6D Pose Estimation Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks Enabling Equivariance for Arbitrary Lie Groups Multi-Scale Memory-Based Video Deblurring Privacy Preserving Partial Localization Towards Robust and Reproducible Active Learning using Neural Networks Marginal Contrastive Correspondence for Exemplar-based Image Translation TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repeated Action Counting Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation FaceFormer: Speech-Driven 3D Facial Animation with Transformers LARGE: Latent-Based Regression Through GAN Semantics TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation AR-NeRF: Unsupervised Learning of Depth and Defocus Effects from Natural Images with Aperture Rendering Neural Radiance Fields CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection SASIC: Stereo Image Compression with Latent Shifts and Stereo Attention Controllable Animation of Fluid Elements in Still Images Revisiting BatchNorm's Learnable Affines in Few-Shot Transfer Learning Learning Graph Regularisation for Guided Super-Resolution Topology Preserving Local Road Network Estimation from Single Onboard Camera Image Video-Text Representation Learning via Differentiable Weak Temporal Alignment BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning Face2Exp: Combating Data Biases for Facial Expression Recognition Leveraging Equivariant Features for Absolute Pose Regression Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry ZZ-Net: A Universal Rotation Equivariant Architecture for 2D Point Clouds Interactive Disentanglement: Learning Concepts by Interacting with their Prototype Representations Incremental Learning in Semantic Segmentation from Image Labels Complex Backdoor Detection by Symmetric Feature Differencing Constrained Few-shot Class-incremental Learning HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet Amodal Panoptic Segmentation Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation Pin the Memory: Learning to Generalize Semantic Segmentation Long-tailed Visual Recognition via Gaussian Clouded Logit Adjustment Knowledge distillation: A good teacher is patient and consistent Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language Searching the Deployable Convolution Neural Networks for GPUs MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing Condensing CNNs with Partial Differential Equations Adaptive Early-Learning Correction for Segmentation from Noisy Annotations Bounded Adversarial Attack on Deep Content Features Towards Driving-Oriented Metric for Lane Detection Models Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness Better Trigger Inversion Optimization in Backdoor Scanning Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers Towards Understanding and Simplifying MoCo: Dual Temperature Helps Contrastive Learning without Many Negative Samples Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer Image Segmentation Using Text and Image Prompts Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation Vision-Language Pre-Training with Triple Contrastive Learning Temporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations Globetrotter: Connecting Languages by Connecting Images Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data It’s Time for Artistic Correspondence in Music and Video Equivariant Point Set Analysis via Learning Orientations for Message Passing KeyTr: Keypoint Transporter for 3D Reconstruction of Deformable Objects in Videos P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction MatchFAME: Fast, Accurate and Memory-Efficient Multi-Object Matching Neural Emotion Director: Speech-preserving semantic control of facial expressions in “in-the-wild” videos Id-Free Person Similarity Learning Alleviating Emotional bias in Affective Image Captioning by Contrastive Data Collection A study on the distribution of social biases in self-supervised learning visual models Motron: Multimodal Probabilistic Human Motion Forecasting Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders Real-time hyperspectral imaging in hardware via trained metasurface encoders SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos Self-supervised Spatial Reasoning on Multi-View Line Drawings Contrastive Test-Time Adaptation Why Discard if You can Recycle?:A Recycling Max Pooling Module for 3D Point Cloud Analysis Do learned representations respect causal relationships? Zero-Query Transfer Attacks on Context-Aware Object Detectors Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm Contrastive Dual Gating: Learning Sparse Features With Contrastive Learning Efficient Maximal Coding Rate Reduction by Variational Forms Everything at Once - Multi-modal Fusion Transformer for Video Retrieval Towards Efficient and Scalable Sharpness-Aware Minimization X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval Merry Go Round: Rotate a Frame and Fool a DNN Label-Only Model Inversion Attacks via Boundary Repulsion Style-Structure Disentangled Features and Normalizing Flows for Diverse Icon Colorization How Much More Data Do I Need? Estimating Requirements For Downstream Tasks A sampling-based approach for efficient clustering in large datasets Deep Equilibrium Optical Flow Estimation Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values Multi-label Iterated Learning for Image Classification with Label Ambiguity Cross-modal Map Learning for Vision and Language Navigation Learning with Neighbor Consistency for Noisy Labels Measuring Compositional Consistency for Video Question Answering Failure Modes of Domain Generalization Algorithms AutoRF: Learning 3D Object Radiance Fields from Single View Observations A Unified Model for Line Projections in Catadioptric Cameras OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning Cluster-guided Image Synthesis with Unconditional Models Self-supervised object detection from audio-visual correspondence Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models How much does input data type impact final face model accuracy? Certified Patch Robustness via Smoothed Vision Transformers PubTables-1M: Towards comprehensive table extraction from unstructured documents Fine-tuning Image Transformers using Learnable Memory GuideFormer: Transformers for Image Guided Depth Completion Motion-Adjustable Neural Implicit Video Representation LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds Multi-modal Alignment using Representation Codebook NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge Investigating Top-$k$ White-Box and Transferable Black-box Attack GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision On the Instability of Relative Pose Estimation and RANSAC’s Role Dual Task Learning by Leveraging Both Dense Correspondence and Mis-Correspondence for Robust Change Detection With Imperfect Matches M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers Dynamic Scene Graph Generation via Anticipatory Pre-training ScanQA: 3D Question Answering for Spatial Scene Understanding PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures Large Images as Long Documents: Hierarchical ViTs with Self-Supervised Pretraining in Gigapixel Image Pyramids Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection On Guiding Visual Attention with Language Specification OnePose: One-Shot Object Pose Estimation without CAD Models Thin-Plate Spline Motion Model for Image Animation PokeBNN: A Binary Pursuit of Lightweight Accuracy Semi-Supervised Few-shot Learning via Multi-Factor Clustering FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback CLIPstyler: Image Style Transfer with a Single Text Condition Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions Out-of-distribution Generalization with Causal Invariant Transformations Zero-Shot Text-Guided Object Generation with Dream Fields Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization NICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation Models Deep Unlearning via Randomized Conditionally Independent Hessians Multi-Modal Dynamic Graph Transformer for Visual Grounding Propagation Regularizer for Semi-supervised Learning with Extremely Scarce Labeled Samples Discrete Wasserstein Distributional Matching for Quantization in Image Hashing Robust fine-tuning of zero-shot models Probabilistic Representations for Video Contrastive Learning Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction Fine-Grained Object Classification via Self-Supervised Pose Alignment One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones A Framework for Learning Ante-hoc Explainable Models via Concepts Retrieval Augmented Classification for Long Tail Visual Recognition Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization Learning Video Representations of Human Motion from Synthetic Data Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation Efficient Deep Embedded Subspace Clustering Local-Adaptive Face Recognition via Graph-based Meta-Clustering and Regularized Adaptation GenDR: A Generalized Differentiable Renderer Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Toward a Unified Model