Cascade Transformers for End-to-End Person Search
Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
Long-Tailed Recognition via Weight Balancing
InfoGCN: Representation Learning for Human Skeleton-based Action Recognition
Interactive Geometry Editing of Neural Radiance Fields
MLSLT: Towards Multilingual Sign Language Translation
360MonoDepth: High-Resolution 360° Monocular Depth Estimation
Generating Diverse and Natural 3D Human Motions from Text
Masked-attention Mask Transformer for Universal Image Segmentation
Pointly-Supervised Instance Segmentation
A Closer Look at Few-shot Image Generation
Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation
Neural 3D Scene Reconstruction with the Manhattan-world Assumption
Masked Autoencoders Are Scalable Vision Learners
De-rendering 3D Objects in the Wild
Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction
Finding Badly Drawn Bunnies
GradViT: Gradient Inversion of Vision Transformers
On the Importance of Asymmetry for Siamese Representation Learning
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks
Rethinking Efficient Lane Detection via Curve Modeling
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
Learning Fair Classifiers with Partially Annotated Group Labels
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
A ConvNet for the 2020s
Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning
Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast
Connecting the Complementary-view Videos: Joint Camera Identification and Subject Association
Decoupled Knowledge Distillation
Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation
Compound Domain Generalization via Meta-Knowledge Encoding
Bilateral Video Magnification Filter
EDTER: Edge Detection with Transformer
Structure-Aware Motion Transfer with Deformable Anchor Model
Attentive Fine-Grained Structured Sparsity for Image Restoration
Sign Language Video Retrieval with Free-Form Textual Queries
SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems
Neural Mean Discrepancy for Efficient Out-of-Distribution Detection
LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints
Focal and Global Knowledge Distillation for Detectors
Enhancing Adversarial Robustness for Deep Metric Learning
Novel Class Discovery in Semantic Segmentation
IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding Alignment
WarpingGAN:Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation
Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection
HyperDet3D: Learning a Scene-Conditioned 3D Object Detector
Deep Decomposition for Stochastic Normal-Abnormal Transport
Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production
Self-supervised Video Transformers
HLRTF: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging
φ-SfT: Shape-from-Template with a Physics-based Deformation Model
Boosting View Synthesis with Residual Transfer
DINE: Domain Adaptation from Single and Multiple Black-box Predictors
Occluded Human Mesh Recovery
Understanding Uncertainty Maps in Vision with Statistical Testing
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets
Learning from Pixel-Level Label Noise: A New Perspective for Light Field Salient Object Detection
Self-Supervised Global-Local Structure Modeling for Point Cloud Domain Adaptation with Reliable Voted Pseudo Labels
Towards An End-to-End Framework for Flow-Guided Video Inpainting
E-CIR: Event-Enhanced Continuous Intensity Recovery
Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization using Satellite Image
Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers
Forward Propagation, Backward Regression and Pose Association for Hand Tracking in the Wild
FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos
Efficient Neural Radiance Fields
Robust Equivariant Imaging: a fully unsupervised framework for learning to image from noisy and partial measurements
HumanNeRF: Efficiently Generated Human Radiance Field from Sparse Inputs
Attributable Visual Similarity Learning
Efficient Multi-view Stereo by Iterative Dynamic Cost Volume
Replacing Labeled Real-image Datasets with Auto-generated Contours
SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere Images
AutoSDF: Shape Priors for 3D Completion, Reconstruction, and Generation
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
PIE-Net: Photometric Invariant Edge Guided Network for Intrinsic Image Decomposition
DST: Dynamic Substitute Training for Data-free Black-box Attack
HCSC: Hierarchical Contrastive Selective Coding
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis
Inertia-Guided Flow Completion and Style Fusion for Video Inpainting
PlaneMVS: 3D Plane Reconstruction from Multi-View Stereo
Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields
Interactiveness Field of Human-Object Interactions
Learning Memory-Augmented Unidirectional Metrics for Cross-modality Person Re-identification
Event-based Video Reconstruction via Potential-assisted Spiking Neural Network
SIGMA: Semantic-complete Graph Matching for Domain Adaptive Object Detection
Surface Reconstruction from Point Clouds by Learning Predictive Context Priors
Active Teacher for Semi-Supervised Object Detection
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning
RCL: Recurrent Continuous Localization for Temporal Action Detection
GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction with Relational Reasoning
SPAMs: Structured Implicit Parametric Models
A Keypoint-based Global Association Network for Lane Detection
Weakly Supervised Semantic Segmentation using Out-of-Distribution Data
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment
Investigating Tradeoffs in Real-World Video Super-Resolution
OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction
Bending Graphs: Hierarchical Shape Matching using Gated Optimal Transport
The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization
SimT: Handling Open-set Noise for Domain Adaptive Semantic Segmentation
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification
Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion
Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation
Stratified Transformer for 3D Point Cloud Segmentation
Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification
ImplicitAtlas: Learning Deformable Shape Templates in Medical Imaging
Sparse Instance Activation for Real-Time Instance Segmentation
Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer
Unsupervised Image-to-Image Translation with Generative Prior
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
Versatile Multi-Modal Pre-Training for Human-Centric Perception
Instance-wise Occlusion and Depth Orders in Natural Scenes
Degradation-agnostic Correspondence from Resolution-asymmetric Stereo
No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces
Multi-Dimensional with Intensity: A Crowd-sourced Method for Measuring the Perception of Facial Expression
Class-Incremental Learning with Strong Pretrained Models
A Patch-centric Error Analysis of Image Super-Resolution
IFOR: Iterative Flow Minimization for Robotic Object Rearrangement
3D-aware Image Synthesis via Learning Structural and Textural Representations
DeeCap: Dynamic Early Exiting for Efficient Image Captioning
GAN-Supervised Dense Visual Alignment
Multilayer GAN Inversion and Editing
On Aliased Resizing and Surprising Subtleties in GAN Evaluation
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
Comparing Correspondences: Video Prediction with Correspondences-wise Losses
Mix and Localize: Localizing Sound Sources from Mixtures
AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D Perception
Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time
Point Cloud Pre-training with Natural 3D Structures
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation
Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction Error
Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Reversible Vision Transformers
RigNeRF: Fully Controllable Neural 3D Portraits
Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation
Integrative Few-Shot Learning for Classification and Segmentation
Learning Affordance Grounding from Exocentric Images
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection
Exploring Geometry Consistency for monocular 3D object detection
Visual Abductive Reasoning
Putting People in their Place: Monocular Regression of 3D People in Depth
Exploiting Explainable Metrics for Augmented SGD
Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation
A Hybrid Quantum-Classical Algorithm for Robust Fitting
Dataset Distillation by Matching Training Trajectories
DiLiGenT10^2: A Photometric Stereo Benchmark Dataset with Controlled Shape and Material Variation
Scene Representation Transformer
ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
Injecting Visual Concepts into End-to-End Image Captioning
Learning Neural Light Fields with Ray-Space Embedding Networks
What's in your hands? 3D Reconstruction of Generic Objects in Hands
Virtual Correspondences: Human as a Cue for Extreme-View Geometry
Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition
SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches
GroupViT: Zero-Shot Transfer to Semantic Segmentation with Text Supervision
LSVC: A Learning-based Stereo Video Compression Framework
BEHAVE: Dataset and Method for Tracking Human Object Interactions
Learning to Align Sequential Actions in the Wild
Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction
Simulated Adversarial Testing of Face Recognition Models
GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping
Ensembling Off-the-shelf Models for GAN Training
Global Tracking Transformers
Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline
Joint Global and Local Hierarchical Priors for Learned Image Compression
D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions
Human-Aware Object Placement for Visual Environment Reconstruction
Dual-path Image Inpainting with Auxiliary GAN Inversion
Accurate 3D Body Shape Regression using Metric and Semantic Attributes
BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information
Capturing and Inferring Dense Full-Body Human-Scene Contact
Not All Labels Are Equal: Rationalizing The Labeling Costs for Training Object Detection
Background Activation Suppression for Weakly Supervised Object Localization
Attribute Group Editing for Reliable Few-shot Image Generation
Negative-aware Attention for Image-Text Matching
Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects
TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions
HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening
gDNA: Towards Generative Detailed Neural Avatars
CaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural Homeomorphism
BACON: Band-limited Coordinate Networks for Multiscale Scene Representation
Revisiting Near/Remote Sensing with Geospatial Attention
Simple multi-dataset detection
Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual Normalization
Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation
Online Convolutional Re-parameterization
Neural Inertial Localization
MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution
Unsupervised Pre-training for Temporal Action Localization Tasks
Augmented Geometric Distillation for Data-Free Incremental Person ReID
HEAT: Holistic Edge Attention Transformer for Structured Reconstruction
NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition
ContrastMask: Contrastive Learning to Segment Every Thing
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs
MAT: Mask-Aware Transformer for Large Hole Image Inpainting
A Comprehensive Study of End-to-End Temporal Action Detection
Rethinking Image Cropping: Exploring Diverse Compositions from Global Views
OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation
Asynchronous Event-based Graph-Neural Networks
RAMA: A Rapid Multicut Algorithm on GPU
EvUnroll: Neuromorphic Events based Rolling Shutter Image Correction
Cycle-Consistent Counterfactuals by Latent Transformations
Understanding 3D Object Articulation in Internet Videos
Synthetic Generation of Face Videos with Plethysmograph Physiology
MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object Detection
Neural Architecture Search with Representation Mutual Information
Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning
Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots
Semi-Supervised Object Detection via Multi-instance Alignment with Global Class Prototypes
Fine-Grained Predicates Learning for Scene Graph Generation
Meta Distribution Alignment for Generalizable Person Re-Identification
Align Representations with Base: A New Approach to Self-Supervised Learning
Style-Based Global Appearance Flow for Virtual Try-On
Learning Semantic Associations for Mirror Detection
Task Decoupled Framework for Reference-based Super-Resolution
Beyond Semantic to Instance Segmentation: Weakly-Supervised Instance Segmentation via Semantic Knowledge Transfer and Self-Refinement
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction
GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras
Fast and Unsupervised Action Boundary Detection for Action Segmentation
Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture
Unified Transformer Tracker for Object Tracking
NeuralHOFusion: Neural Volumetric Rendering under Human-object Interactions
H$^2$FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-domain Weakly Supervised Object Detection
ICON: Implicit Clothed humans Obtained from Normals
Semantic-Aware Domain Generalized Segmentation
ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation
Detecting Deepfakes with Self-Blended Images
Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization
FreeSOLO: Learning to Segment Objects without Annotations
Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage
Differentially Private Federated Learning with Local Regularization and Sparsification
Modeling 3D Layout For Group Re-Identification
DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning
Structured Local Radiance Fields for Human Avatar Modeling
Contrastive Regression for Domain Adaptation on Gaze Estimation
Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition
Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification
Tree Energy Loss: Towards Sparsely Annotated Semantic Segmentation
Learning Second Order Local Anomaly for General Face Forgery Detection
LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network
Audio-Adaptive Activity Recognition Across Video Domains
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective
Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos
Omnivore: A Single Model for Many Visual Modalities
Multi-Frame Self-Supervised Depth with Transformers
Voice-Face Homogeneity Tells Deepfake
Representation Compensation Networks for Continual Semantic Segmentation
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
FLAVA: A Foundational Language And Vision Alignment Model
Vision Prompt Tuning
Vehicle trajectory prediction works, but not everywhere
Camera-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentification
ReSTR: Convolution-free Referring Image Segmentation Using Transformers
DATA: Domain-Aware and Task-Aware Self-supervised Learning
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval
Balanced MSE for Imbalanced Visual Regression
The Devil Is in the Details: Window-based Attention for Image Compression
DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding
Video Frame Interpolation Transformer
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling
LASER: LAtent SpacE Rendering for 2D Visual Localization
LaTr: Layout-Aware Transformer for Scene-Text VQA
Universal Photometric Stereo Network using Global Lighting Contexts
Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training
Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models
Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis
AdaViT: Adaptive Tokens for Efficient Vision Transformer
Neural Template: Topology-aware Reconstruction and Disentangled Generation of 3D Meshes
CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition
Cross-Modal Transferable Adversarial Attacks from Images to Videos
PTTR: Relational 3D Point Cloud Object Tracking with Transformer
Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds
Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation
Object Localization under Single Coarse Point Supervision
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Reinforced Structured State-Evolution for Vision-Language Navigation
Learning to Anticipate Future with Dynamic Context Removal
Learning Program Representations for Food Images and Cooking Recipes
Transferability Estimation using Bhattacharyya Class Separability
LiDAR Snowfall Simulation for Robust 3D Object Detection
Masked Feature Prediction for Vision Self-Supervised Pre-Training
Unbiased Teacher v2: Semi-supervised Object Detection for Anchor-free and Anchor-based Detectors
Shape from Polarization for Complex Scenes in the Wild
PhotoScene: Physically-Based Material and Lighting Transfer for Indoor Scenes
Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization
Selective-Supervised Contrastive Learning with Noisy Labels
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation
TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing
Leveraging Self-Supervision for Cross-Domain Crowd Counting
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency
TimeReplayer: Unlocking the Potential of Event Cameras for Video Interpolation
Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segmentation
Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Probabilistic Warp Consistency for Weakly-Supervised Semantic Correspondences
DIFNet: Boosting Visual Information Flow for Image Captioning
ScaleNet: A Shallow Architecture for Scale Estimation
HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images
Density-preserving Deep Point Cloud Compression
Exploring Dual-task Correlation for Pose Guided Person Image Generation
Exploring Endogenous Shift for Cross-domain Detection: A Large-scale Benchmark and Perturbation Suppression Network
Transferability metrics for selecting Source Model Ensembles
The Auto Arborist Dataset: A Large-Scale Benchmark for Multimodal Urban Forest Monitoring Under Domain Shift
EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation
Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection
Learning from Temporal Gradient for Semi-supervised Action Recognition
JoinABLe: Learning Bottom-up Assembly of Parametric CAD Joints
DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion
Defensive Patches for Robust Recognition in the Physical World
UniCoRN: A Unified Conditional Image Repainting Network
APES: Articulated Part Extraction from Sprite Sheets
Learning Deep Implicit Functions for 3D Shapes with Dynamic Code Clouds
Neural Rays for Occlusion-aware Image-based Rendering
DisARM: Displacement Aware Relation Module for 3D Detection
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration
RIM-Net: Recursive Implicit Fields for Unsupervised Learning of Hierarchical Shape Structures
Weakly Supervised Object Localization as Domain Adaption
Reflash Dropout in Image Super-Resolution
Semantic Segmentation by Early Region Proxy
EyePAD++: A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular Images
Online Learning of Reusable Abstract Models for Object Goal Navigation
Time Microscope: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion
OSOP: A Multi-Stage One Shot Object Pose Estimation Framework
Localization Distillation for Dense Object Detection
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
Cross-Image Relational Knowledge Distillation for Semantic Segmentation
Trustworthy Long-tailed Classification
Episodic Memory Question Answering
REX: Reasoning-aware and Grounded Explanation
Query and Attention Augmentation for Knowledge-Based Explainable Reasoning
LOLNerf: Learn from One Look
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
CoNeRF: Controllable Neural Radiance Fields
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space
UnweaveNet: Unweaving Activity Storiess
MeMOT: Multi-Object Tracking with Memory
VisualHow: Multimodal Problem Solving
Affine Medical Image Registration with Coarse-to-Fine Vision Transformer
Unpaired Deep Image Deraining Using Dual Contrastive Learning
DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis
Mask Transfiner for High-Quality Instance Segmentation
GLASS: Geometric Latent Augmentation for Shape Spaces
Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning
Multi-modal Extreme Classification
CodedVTR: Codebook-Based Sparse Voxel Transformer in Geometric Regions
Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization
Self-augmented Unpaired Image Dehazing via Density and Depth Decomposition
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection
Cross-modal Representation Learning for Zero-shot Action Recognition
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation
AUV-Net: Learning Aligned UV Maps for Texture Transfer and Synthesis
Bijective Mapping Network for Shadow Removal
ObjectFormer for Image Manipulation Detection and Localization
GraFormer: Graph-oriented Transformer for 3D Pose Estimation
Multi-Granularity Alignment Domain Adaptation for Object Detection
Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection
Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors
3D Scene Painting via Semantic Image Synthesis
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
One-bit Active Query with Contrastive Pairs
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction
Leveraging Object-Level Rotation Equivariance for 3D Object Detection
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
JIFF: Jointly-aligned Implicit Face Function for High Fidelity Single View Clothed Human Reconstruction
Prompt Distribution Learning
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds
Noisy Boundaries: Lemon or Lemonade for Semi-supervised Instance Segmentation?
Interactive Image Synthesis with Panoptic Layout Generation
Learning to Find Good Models in RANSAC
Meta-attention for ViT-backed Continual Learning
Deep Anomaly Discovery from Unlabeled Videos via Normality Advantage and Self-Paced Refinement
Improving neural implicit surfaces geometry with patch warping
Rope3D: Take A New Look from the 3D Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task
AME: Attention and Memory Enhancement in Hyper-Parameter Optimization
TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
Automated Progressive Learning for Efficient Training of Vision Transformers
Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions
Towards Implicit Text-Guided 3D Shape Generation
Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation
Revisiting skeleton-based action recognition
Mutual Quantization for Cross-Modal Search with Noisy Labels
Revisiting Temporal Alignment for Video Restoration
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
Video Frame Interpolation with Transformer
Autofocus for Event Cameras
Event-based Direct Sparse Odometry
OpenTAL: Towards Open Set Temporal Action Localization
Programmatic Concept Learning for Human Motion Description and Synthesis
MAXIM: Multi-Axis MLP for Image Processing
Temporal Alignment Networks for Long-term Video
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches
Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images
Progressive End-to-End Object Detection in Crowded Scenes
Object-aware Video-language Pre-training for Retrieval
Multi-Source Uncertainty Mining for Deep Unsupervised Saliency Detection
Surface Representation for Point Clouds
Context-Aware Video Reconstruction for Rolling Shutter Cameras
MonoScene: Monocular 3D Semantic Scene Completion
Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts
Point Cloud Color Constancy
HDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging
iPLAN: Interactive and Procedural Layout Planning
End-to-End Multi-Person Pose Estimation with Transformers
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Adversarial Eigen Attack on Black-Box Models
Domain-Aware Representation Learning for Unsupervised Domain Generalization
Sub-word Level Lip Reading With Visual Attention
Efficient Video Instance Segmentation via Tracklet Query and Proposal
Towards cross-modal pose localization from text-based position descriptions
Opening up Open World Tracking
Dynamic Clustering Mask Transformers for Panoptic Segmentation
Compressive Single-Photon 3D Cameras
Style-ERD: Responsive and Coherent Online Motion Style Transfer
MixFormer: Mixing Features across Windows and Dimensions
Robust Image Forgery Detection over Online Social Network Shared Images
Semantic-aligned Fusion Transformer for One-shot Object Detection
Long-term Video Frame Interpolation Via Feature Propagation
Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
ETHSeg: An Amodel Instance Segmentation Network and a Real-world Dataset for X-Ray Waste Inspection
SEEG: Semantic Energized Co-speech Gesture Generation
Instance-Dependent Label-Noise Learning With Manifold-Regularized Transition Matrix Estimation
Acquiring a Dynamic Light Field through a Single-Shot Coded Image
How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting
FaceVerse: a Fine-grained and Detail-changeable 3D Neural Face Model from a Hybrid Dataset
Learning Where to Learn in Cross-View Self-Supervised Learning
Automatic Relation-aware Graph Network Proliferation
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability
En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning
Unsupervised Learning of Accurate Siamese Tracking
Accelerating DETR Convergence via Semantic-Aligned Matching
Co-advise: Cross Inductive Bias Distillation
Medial Spectral Coordinates for 3D Shape Analysis
Coupled Iterative Refinement for 6D Multi-Object Pose Estimation
DeepCurrents: Learning Implicit Representations of Shapes with Boundaries
Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation
Day-to-Night Image Synthesis for Training Nighttime Neural ISPs
Playable Environments: Video Manipulation in Space and Time
Unified Contrastive Learning in Image-Text-Label Space
Many-to-many Splatting for Efficient Video Frame Interpolation
Uncertainty-Aware Deep Multi-View Photometric Stereo
Multi-Robot Active Mapping via Neural Bipartite Graph Matching
Location-free Human Pose Estimation
Multiview Transformers for Video Recognition
RIO: Rotation-equivariance supervised learning of robust inertial odometry
Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment
MiniViT: Compressing Vision Transformers with Weight Multiplexing
Pop-Out Motion: 3D-Aware Image Deformation via Learning Shape Laplacian
On the Road to Online Adaptation for Semantic Image Segmentation
Generalized Binary Search Network for Highly-Efficient Multi-View Stereo
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens
Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning
Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation
DLFormer:Discrete Latent Transformer for Video Inpainting
Continuous Scene Representations for Embodied AI
vCLIMB: A Novel Video Class Incremental Learning Benchmark
NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration
ONCE-3DLanes: Building Monocular 3D Lane Detection
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer
HairMapper: Removing Hair from Portraits Using GANs
Dist-PU: Positive-Unlabeled Learning from a Label Distribution Perspective
Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection
Interactive Multi-Class Tiny-Object Detection
Generalizable Human Pose Triangulation
Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking
A Simple Episodic Linear Probe Improves Visual Recognition in the Wild
Learning to Learn by Jointly Optimizing Neural Architecture and Weights
Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning
Learning Soft Estimator of Keypoint Scale and Orientation with Probabilistic Covariant Loss
Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence Margin
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation
Depth-Aware Generative Adversarial Network for Talking Head Video Generation
OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data
Improving Adversarially Robust Few-shot Image Classification with Generalizable Representations
DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion
Stable Long-Term Recurrent Video Super-Resolution
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization
SelfD: Self-Learning Large-Scale Driving Policies From the Web
InstaFormer: Instance-Aware Image-to-Image Translation with Transformer
AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation
GASP, a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation
Exploring and Evaluating Image Restoration Potential in Dynamic Scenes
Multi-level Feature Learning for Contrastive Multi-view Clustering
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data
Threshold Matters in WSSS: Manipulating the Activation for the Robust and Accurate Segmentation Model Against Thresholds
StyleSwin: Transformer-based GAN for High-resolution Image Generation
Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels
Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery
Splicing ViT Features for Semantic Appearance Transfer
Optimizing Video Prediction via Video Frame Interpolation
Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects
HARA: A Hierarchical Approach for Robust Rotation Averaging
Revisiting Weakly Supervised Pre-Training of Visual Perception Models
Safe-Student for Safe Deep Semi-Supervised Learning with Unseen-Class Unlabeled Data
PatchFormer: An Efficient Point Transformer with Patch Attention
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera with Global Reset Feature
Conditional Prompt Learning for Vision-Language Models
Stability-driven Contact Reconstruction From Monocular Color Images
SharpContour: A Contour-based Boundary Refinement Approach for Efficient and Accurate Instance Segmentation
MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning
GeneralDepth: Unsupervised Learning of Single-Image Depth Estimation in General Scenes
Revisiting AP Loss for Dense Object Detection: Adaptive Ranking Pair Selection
No-Reference Point Cloud Quality Assessment via Domain Adaptation
DArch: Dental Arch Prior-assisted 3D Tooth Instance Segmentation with Weak Annotations
Self-Supervised Keypoint Discovery in Behavioral Videos
Toward Practical Self-Supervised Monocular Indoor Depth Estimation
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image Synthesis
Learning the Degradation Distribution for Blind Image Super-Resolution
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Exploiting Rigidity Constraints for LiDAR Scene Flow Estimation
Democracy Does Matter: Comprehensive Feature Mining for Co-Salient Object Detection
Unsupervised Domain Adaptation for Nighttime Aerial Tracking
UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation
3D Shape Reconstruction from 2D Images with Disentangled Attribute Flow
Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification
Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer
StyTr2: Image Style Transfer with Transformers
BokehMe: When Neural Rendering Meets Classical Rendering
Memory-augmented Deep Conditional Unfolding Network for Pan-sharpening
Learning Object Context for Novel-view Scene Layout Generation
FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment
TCTrack: Temporal Contexts for Aerial Tracking
RBGNet: Ray-based Grouping for 3D Object Detection
3PSDF: Three-Pole Signed Distance Function for Learning Surfaces with Arbitrary Topologies
PanopticNeRF: A Semantic Object-Aware Neural Scene Representation
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation
Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer
Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution
Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera
A Voxel Graph CNN for Object Classification with Event Cameras
How Good Is Aesthetic Ability of a Fashion Model?
Recurrent Dynamic Embedding for Video Object Segmentation
Self-Distillation from the Last Mini-Batch for Consistency Regularization
Group Contextualization for Video Recognition
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos
Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution
Urban Radiance Fields
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack
PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence
Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images
Global Sensing and Measurements Reuse for Image Compressed Sensing
AKB-48: A Real-World Articulated Object Knowledge Base
Structured Sparse R-CNN for Direct Scene Graph Generation
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing
Spectral Unsupervised Domain Adaptation for Visual Recognition
SimMatch: Semi-supervised Learning with Similarity Matching
Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading
POCO: Point Convolution for Surface Reconstruction
HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging
Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and Beyond
FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling and Correction
Open-set Text Recognition via Character-Context Decoupling
Generalized Few-shot Semantic Segmentation
Causal Transportability for Neural Representations
Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition
Matching Feature Sets for Few-Shot Image Classification
Interactron: Embodied Adaptive Object Detection
It’s About Time: Analog Clock Reading in the Wild
A Graph Matching Perspective with Transformers on Video Instance Segmentation
GIF: Neural Implicit Function for General Shape Representation
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
Language as Queries for Referring Video Object Segmentation
Federated Class-Incremental Learning
Human Hands as Probes for Interactive Object Understanding
STIF: Learning Continuous Video Representation for Space-Time Super-Resolution
Bridging Video-text Retrieval with Multiple Choice Questions
FoggyStereo: Stereo Matching with Fog Volume Representation
MonoGround: Detecting Monocular 3D Objects from the Ground
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation
ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding
Local Texture Estimator for Implicit Representation Function
Neural Recognition of Dashed Curves with Gestalt Law of Continuity
Voxel Field Fusion for 3D Object Detection
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization
H4D: Human 4D Modeling by Learning Neural Compositional Representation
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer
A Unified Query-based Paradigm for Point Cloud Understanding
AdaInt: Learning Adaptive Intervals for 3D Lookup Tables on Real-time Image Enhancement
FS6D: Few-Shot 6D Pose Estimation of Novel Objects
CLIP-Event: Connecting Text and Images with Event Structures
Category Contrast for Unsupervised Domain Adaptation in Visual Tasks
GateHUB: Gated History Unit with Background Suppression for Online Action Detection
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Learning 3D Object Shape and Layout without 3D Supervision
Discrete Cosine Transform Network for Guided Depth Super-Resolution
DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification
Recurrent Glimpse-based Decoder for Detection with Transformer
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR
Multi-Object Tracking Meets Moving UAV
Estimating Fine-Grained Noise Model via Contrastive Learning
ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues
Task-specific Inconsistency Alignment for Domain Adaptive Object Detection
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization
Global-Aware Registration of Less-Overlap RGB-D Scans
XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font Generation
A Simple Data Mixing Prior for Improving Self-Supervised Vision Transformer
Dense Learning based Semi-Supervised Object Detection
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization
Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation
Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution
End-to-end Generative Pretraining for Multimodal Video Captioning
Exposure Normalization and Compensation for Multiple Exposure Correction
Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks
Multi-label Classification with Partial Annotations using Class-aware Selective Loss
Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction
IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction
Decoupling Makes Weakly Supervised Local Feature Better
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds
Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification
Semi-Weakly-Supervised Learning of Complex Actions from Instructional Videos
Set-Supervised Action Learning in Procedural Videos via Pairwise Order Consistency
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation
BANMo: Building Animatable 3D Neural Models from Many Casual Videos
HD-CSE: Learning Dense Correspondence of Clothed Humans with Vision Transformers
Efficient Geometry-aware 3D Generative Adversarial Networks
CAPRI-Net: Learning Compact CAD Shapes with Adaptive Primitive Assembly
HL-Net: Heterophily Learning Network for Scene Graph Generation
Towards Efficient Data Free Black-box Adversarial Attack
Neural Collaborative Graph Machines for Table Structure Recognition
Dimension Embeddings for Monocular 3D Object Detection
Nested Collaborative Learning for Long-Tailed Visual Recognition
Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels
Calibrating Deep Neural Networks by Pairwise Constraints
HybridCR: Weakly-Supervised 3D Point Cloud Semantic Segmentation via Hybrid Contrastive Regularization
Few-Shot Font Generation by Learning Fine-Grained Local Styles
Point-NeRF: Point-based Neural Radiance Fields
Spatial-Temporal Space Hand-in-Hand: Spatial-Temporal Video Super-Resolution via Cycle-Projected Mutual Learning
Learning from All Vehicles
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark
DETReg: Unsupervised Pretraining with Region Priors for Object Detection
Rethinking Semantic Segmentation: A Prototype View
Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection
MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image
Spatio-temporal Relation Modeling for Few-shot Action Recognition
RestoreFormer: High-Quality Blind Face Restoration from Undegraded Key-Value Pairs
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
Domain-Agnostic Prior for Unsupervised Transfer Segmentation
Unimodal-Concentrated Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation
Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning
Semi-Supervised Video Semantic Segmentation with Inter-Frame Feature Reconstruction
Revisiting the "Video" in Video-Language Understanding
SNUG: Self-Supervised Neural Dynamic Garments
FocalClick: Towards Practical Interactive Image Segmentation
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation
Temporally Efficient Vision Transformer for Video Instance Segmentation
C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image
Adversarial Texture for Fooling Person Detectors in the Physical World
Automatic Color Image Stitching Using Quaternion Rank-1 Alignment
TemporalUV: Capturing Loose Clothing with Temporally Coherent UV Coordinates
Kernelized Few-shot Object Detection by Integral Aggregation
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data
Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model
FocusCut: Diving into a Focus View in Interactive Segmentation
Mutual Information-driven Pan-sharpening
Gradient-SDF: A Semi-Implicit Surface Representation for 3D Reconstruction
Neural Head Avatars from Monocular RGB Videos
Point-Level Region Contrast for Object Detection Pre-Training
HODEC: Towards Efficient High-Order DEcomposed Convolutional Neural Networks
Bridging Global Context Interactions for High-Fidelity Image Completion
CDGNet: Class Distribution Guided Network for Human Parsing
Primitive3D: Learning from 3D Objects Assembled with Random Primitives
HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video
TransMix: Attend to Mix for Vision Transformers
JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection
Few-shot Head Swapping in the Wild
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis
Embracing Single Stride 3D Object Detector with Sparse Transformer
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data
Expanding Low-Density Latent Regions for Open-Set Object Detection
GMFlow: Learning Optical Flow via Global Matching
Source-Free Domain Adaptation via Distribution Estimation
Aesthetic Text Logo Synthesis via Content-aware Layout Inferring
An Image Patch is a Wave: Phase-Aware Vision MLP
FisherMatch: Semi-Supervised Rotation Regression via Entropy-based Filtering
BE-STI: Spatial-Temporal Integrated Network for Class-agnostic Motion Prediction with Bidirectional Enhancement
DC-SSL: Addressing Mismatched Class Distribution in Semi-supervised Learning
Deterministic Point Cloud Registration via Novel Transformation Decomposition
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos
Deep Visual Geo-localization Benchmark
LC-FDNet: Learned Lossless Image Compression with Frequency Decomposition Network
Towards Robust Vision Transformer
Volumetric Bundle Adjustment for Photorealistic Real-time Reconstruction
Continual Test-Time Domain Adaptation
Scribble-Supervised LiDAR Semantic Segmentation
TableFormer: Table Structure Understanding with Transformers
Focal Sparse Convolutional Networks for 3D Object Detection
CLRNet: Cross Layer Refinement Network for Lane Detection
Transformer Based Line Segment Classifier with Image Context for Real-Time Vanishing Point Detection in Manhattan World
NeRFReN: Neural Radiance Fields with Reflections
HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing
Ditto: Building Digital Twins of Articulated Objects from Interaction
CroMo: Cross-Modal Learning for Monocular Depth Estimation
Mobile-Former: Bridging MobileNet and Transformer
MetaFormer is Actually What You Need for Vision
RU-Net: Regularized Unrolling Network for Scene Graph Generation
Dreaming to Prune Image Deraining Networks
Salvage of Supervision in Weakly Supervised Object Detection
Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
FMCNet: Feature-Level Modality Compensation for Visible-Infrared Person Re-Identification
Generalizing Gaze Estimation with Rotation Consistency
SIOD: Single Instance Annotated Per Category Per Image for Object Detection
Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification
A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift
Manifold Learning Benefits GANs
Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing
OW-DETR: Open-world Detection Transformer
Learning Optimal K-space Acquisition and Reconstruction using Physics-Informed Neural Networks
Global Tracking via Ensemble of Local Trackers
Robust Region Feature Synthesizer for Zero-Shot Object Detection
Confidence Propagation Cluster: Unleash Full Potential of Object Detectors
PartGlot: Learning Shape Part Segmentation from Language Reference Games
Self-Taught Metric Learning without Labels
GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting
OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion
3D Common Corruptions and Data Augmentation
DIVeR: Real-time and Accurate Neural Radiance Fields with Deterministic Integration for Volume Rendering
Boosting Robustness of Image Matting with Context Assembling and Strong Data Augmentation
Cross-modal Clinical Graph Transformer For Ophthalmic Report Generation
Correlation-Aware Deep Tracking
Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data
Block-NeRF: Scalable Large Scene Neural View Synthesis
Vector Quantized Diffusion Model for Text-to-Image Synthesis
Boosting Crowd Counting via Multifaceted Attention
Physically-guided Disentangled Implicit Rendering for 3D Face Modeling
IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
Back to Reality: Weakly-supervised 3D Detection with Shape-guided Label Enhancement
Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding
Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel
Reduce Information Loss in Transformers for Pluralistic Image Inpainting
OCSampler: Compressing Videos to One Clip with Single-step Sampling
Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network
SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation
High-resolution Face Swapping via Latent Semantics Disentanglement
Deep Rectangling for Image Stitching: A Learning Baseline
Detector-Free Weakly Supervised Group Activity Recognition
Unsupervised Domain Generalization by learning a Bridge Across Domains
RSCFed: Random Sampling Consensus Federated Semi-supervised Learning
IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution
Learned Queries for Efficient Local Attention
Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance Capture
Robust Contrastive Learning against Noisy Views
Discovering Objects that Can Move
TubeFormer-DeepLab: Video Mask Transformer
Sparse and Complete Latent Organization for Geospatial Semantic Segmentation
ITSA: An Information Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks
Few-shot Backdoor Defense Using Shapley Estimation
Exploring Domain-Invariant Parameters for Source Free Domain Adaptation
Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition
Likert Scoring with Grade Decoupling for Long-term Action Assessment
Unpaired Cartoon Image Synthesis via Gated Cycle Mapping
Contextual Instance Decoupling for Robust Multi-Person Pose Estimation
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes
Modulated Contrast for Versatile Image Translation
Oriented RepPoints for Aerial Object Detection
INS-Conv: Incremental Sparse Convolution for Online 3D Segmentation
PanopticDepth: Instance-Decoupled Depth Estimation for Unified Depth-Aware Panoptic Segmentation
Point-BERT : Pre-Training 3D Point Cloud Transformers with Masked Point Modeling
Implicit Sample Extension for Unsupervised Person Re-Identification
Incorporating Semi-Supervised and Positive-Unlabeled learning for Boosting Full Reference Image Quality Assessment
HairCLIP: Design Your Hair by Text and Reference Image
C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object Detection
MogFace: Towards a Deeper Appreciation on Face Detection
RegionCLIP: Region-based Language-Image Pretraining
HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network
Structure-Aware Flow Generation for Human Body Reshaping
Revisiting Document Image Dewarping by Grid Regularization
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Bridging the Gap between Classification and Localization for Weakly Supervised Object Localization
Shunted Self-Attention via Multi-Scale Token Aggregation
VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention
MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
YouMVOS: An Actor-centric Multi-shot Video Object Segmentation Dataset
Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
DiSparse: Disentangled Sparsification for Multitask Model Compression
Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction
Weakly Supervised High-Fidelity Clothing Model Generation
Deep Generalized Unfolding Networks for Image Restoration
Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap
ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework
Iterative Deep Homography Estimation
Homography Loss for Monocular 3D Object Detection
Infrared Invisible Clothing: Hiding from Infrared Detectors at Multiple Angles in Real World
Deep Stereo Image Compression via Bi-directional Coding
Degree-of-linear-polarization-based Color Constancy
Unleashing Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification
Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning with Pairwise Alignment
Learning Transferable Human-Object Interaction Detector with Natural Language Supervision
PNP: Robust Learning from Noisy Labels by Probabilistic Noise Prediction
RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search
Few-shot Keypoint Detection with Uncertainty Learning for Unseen Species
Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation
``The Pedestrian next to the Lamppost'' Adaptive Object Graphs for Better Instantaneous Mapping
Point2Seq: Detecting 3D Objects as Sequences
Towards Noiseless Object Contours for Weakly Supervised Semantic Segmentation
Syntax-Aware Network for Handwritten Mathematical Expression Recognition
RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging
A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance Genres
BNVF: Dense 3D Reconstruction using Bi-level Neural Volume Fusion
AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks
Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework
Cross-domain Few-shot Learning with Task-specific Adapters
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Geometric and Textural Augmentation for Domain Gap Reduction
Geometric Transformer for Fast and Robust Point Cloud Registration
Group R-CNN for Point-based Weakly Semi-supervised Object Detection
Wnet: Audio-Guided Video Semantic Segmentation via Wavelet-Based Cross-Modal Denoising Networks
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds
ELSR: Efficient Line Segment Reconstruction with Planes and Points Guidance
A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos
Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer
End-to-End Referring Video Object Segmentation with Multimodal Transformers
Neural fields as learnable kernels for 3D reconstruction
IDR: Self-Supervised Image Denoising via Iterative Data Refinement
TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers
SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization
Deep vanishing point detection: Geometric priors make dataset variations vanish
On Adversarial Robustness of Trajectory Prediction for Autonomous Vehicles
Learning Multiple Dense Prediction Tasks from Partially Annotated Data
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free
Video Demoireing with Relation-based Temporal Consistency
FLAG: Flow-based 3D Avatar Generation from Sparse Observations
Learning an Optimal Linear Program for Multi-Target Tracking
IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images
Stereoscopic Universal Perturbations across Different Architectures and Datasets
The Flag Median and FlagIRLS
NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images
BoxeR: Box-Attention for 2D and 3D Transformers
DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation
UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection
CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Learning To Recognize Procedural Activities with Distant Supervision
Audio-driven Neural Gesture Reenactment with Video Motion Graphs
Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence
Hire-MLP: Vision MLP via Hierarchical Rearrangement
Escaping Data Scarcity for High-Resolution Heterogeneous Face Hallucination
DeepDPM: Deep Clustering With an Unknown Number of Clusters
ZeroWaste Dataset: Towards Deformable Object Segmentation in Cluttered Scenes
Context-Aware Sequence Alignment using 4D Skeletal Augmentation
COAP: Compositional Articulated Occupancy of People
Sound and Visual Representation Learning with Multiple Pretraining Tasks
The Wanderings of Odysseus in 3D Scenes
Deblurring via Stochastic Refinement
SMPL-A: Modeling Person-Specific Deformable Anatomy
Neural Point Light Fields
FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning
ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation
Adversarial Parametric Pose Prior
Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior
Pre-Training meets Self-Training for Supersizing 3D Reconstruction
Safe Self-Refinement for Transformer-based Domain Adaptation
ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses
Towards Multimodal Depth Estimation from Light Fields
Deformable Sprites for Unsupervised Video Decomposition
Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection
MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting
Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations
Semi-supervised Semantic Segmentation with Error Localization Network
Quantization-aware Deep Optics for Snapshot Hyperspectral Imaging
Gravitationally Lensed Black Hole Emission Tomography
Improving Video Model Transfer with Dynamic Representation Learning
FWD: Real-time Novel View Synthesis with Forward Warping and Depth
Enhancing Adversarial Training with Second-Order Statistics of Weights
Patch Slimming for Efficient Vision Transformers
3DAC: Learning Attribute Compression for Point Clouds
SNR-Aware Low-light Image Enhancement
Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation
Motion-modulated Temporal Fragment Alignment Network For Few-Shot Action Recognition
Self-Supervised Bulk Motion Artifact Removal in Optical Coherence Tomography Angiography
Salient-to-Broad Transition for Video Person Re-identification
Which images to label for few-shot medical landmark detection?
Hybrid Relation Guided Set Matching for Few-shot Action Recognition
Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction
Bringing Old Films Back to Life
Face Relighting with Geometrically Consistent Shadows
Learning Cloth-Irrelevant Features for Cloth-Changing Person Re-identification
DPICT: Deep Progressive Image Compression Using Trit-Planes
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
Simple but Effective: CLIP Embeddings for Embodied AI
Scene Consistency Representation Learning for Video Scene Segmentation
Neural Data-Dependent Transform for Learned Image Compression
CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation
Global Matching with Overlapping Attention for Optical Flow Estimation
Meta Agent Teaming Active Learning for Pose Estimation
Robust Combination of Distributed Gradients Under Adversarial Perturbations
Toward Fast, Flexible, and Robust Low-Light Image Enhancement
Motion-aware Contrastive Video Representation Learning via Foreground-background Merging
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
L-Verse: Bidirectional Generation Between Image and Text
GANORCON: Are Generative Models Useful for Few-shot Segmentation?
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Towards Robust Adaptive Object Detection under Noisy Annotations
Point2Cyl: Reverse Engineering 3D Objects -- from Point Clouds to Extrusion Cylinders
MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation
Subspace Adversarial Training
Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation
UniVIP: A Unified Framework for Self-Supervised Visual Pre-training
MUM : Mix Image Tiles and UnMix Feature Tiles for Semi-Supervised Object Detection
SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud
On the Integration of Self-Attention and Convolution
Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation
Human Instance Matting via Mutual Guidance and Multi-Instance Refinement
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
Causality Inspired Representation Learning for Domain Generalization
Learning Local Displacements for Point Cloud Completion
Remember Intentions: Retrospective-Memory-based Trajectory Prediction
Contextual Similarity Distillation for Asymmetric Image Retrieval
Self-Supervised Models are Continual Learners
High-Fidelity Human Avatars from a Single RGB Camera
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
TWIST: Two-Way Inter-label Self-Training for Semi-supervised 3D Instance Segmentation
Focal length and object pose estimation via render and compare
Kubric: A scalable dataset generator
VRDFormer: End-to-End Video Visual Relation Detection with Transformers
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection
Brain-inspired Multilayer Perceptron with Spiking Neurons
Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection
High Quality Segmentation for Ultra High-resolution Images
Physically Disentangled Intra- and Inter-domain Adaptation for Varicolored Haze Removal
HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network
Future Transformer for Long-term Action Anticipation
Decoupling Zero-Shot Semantic Segmentation
Long-tail Recognition via Compositional Knowledge Transfer
Open Challenges in Deep Stereo: the Booster Dataset
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations
Recall@k Surrogate Loss with Large Batches and Similarity Mixup
PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision
Dynamic Dual-Output Diffusion Models
End-to-End Human-Gaze-Target Detection with Transformers
EMOCA: Emotion Driven Monocular Face Capture and Animation
R(Det)$^2$: Randomized Decision Routing for Object Detection
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition
NeurMiPs: Neural Mixture of Planar Experts for View Synthesis
Learning to generate line drawings that convey geometry and semantics
AlignQ: Alignment Quantization with ADMM-based Correlation Preservation
Learning Embodied Object-Search Strategies from 50k Human Demonstrations
Longitudinal Self-Supervision for Learning 2D Amodal Representation
Controllable Dynamic Multi-Task Architectures
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Depth-supervised NeRF: Fewer Views and Faster Training for Free
Learning to Detect Mobile Objects from LiDAR Scans Without Labels
Revisiting Random Channel Pruning for Neural Network Compression
ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Learning sRGB-to-Raw De-rendering with Content-Aware Metadata
SimVQA: Exploring Simulated Environments for Visual Question Answering
Cross-Domain Adaptive Teacher for Object Detection
Modality-Agnostic Learning for Radar-Lidar Fusion in Vehicle Detection
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering
Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture
Holocurtains: Programming Light Curtains via Binary Holography
Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy
3D human tongue reconstruction from single "in-the-wild" images
Pushing the Performance Limit of Scene Text Recognizer without Human Annotation
SAR-Net: Shape Alignment and Recovery Network for Category-level 6D Object Pose and Size Estimation
Improving Subgraph Recognition with Variational Graph Information Bottleneck
Towards Multi-domain Single Image Dehazing via Test-time Training
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
CHEX: CHannel EXploration for CNN Model Compression
ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural Representations
Deblur-NeRF: Neural Radiance Fields from Blurry images
An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation
Distribution Consistent Neural Architecture Search
Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer
Glass Segmentation using Intensity and Spectral Polarization Cues
GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings
Unsupervised Deraining: Where Contrastive Learning Meets Self-similarity
Delving into the Estimation Shift of Batch Normalization in a Network
Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light
Full-Range Virtual Try-On with Recurrent Tri-Level Transformation
Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation
Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks
Protecting Celebrities from DeepFake with Identity Consistency Transformer
SVIP: Sequence VerIfication for Procedures in Videos
Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos
Deep Saliency Prior for Reducing Visual Distraction
ClothFormer: Taming Video Virtual Try-on in All Module
FLARF: Fast LArge-scale Radiance Field Reconstruction
Estimating Structural Disparities in Face Models
Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding
Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Scene Graph Expansion for Semantics-Guided Image Outpainting
Deep Constrained Least Squares for Blind Image Super-Resolution
MaskGIT: Masked Generative Image Transformer
CMT: Convolutional Neural Networks Meet Vision Transformers
GraftNet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature
SoftGroup for 3D Instance Segmentation on Point Clouds
Partial Class Activation Attention for Semantic Segmentation
AnyFace: Free-style Text-to-Face Synthesis and Manipulation
PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound
LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection
Make It Move: Controllable Image-to-Video Generation with Text Descriptions
Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels
Learning What Not to Segment: A New Perspective on Few-Shot Segmentation
TT-VSR: Learning Trajectory-Aware Transformer for Video Super-Resolution
Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes
DyRep: Bootstrapping Training with Dynamic Re-parameterization
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
GreedyNASv2: Greedier Search with a Greedy Path Filter
HDR-NeRF: High Dynamic Range Neural Radiance Fields
Novel-View Object Selection in Neural Volumetric Representations
Relieving Long-tailed Instance Segmentation via Pairwise Class Balance
Complex Video Action Reasoning via Learnable Markov Logic Network
PCL: Proxy-based Contrastive Learning for Domain Generalization
Unifying Motion Deblurring and Frame Interpolation with Events
Shape-invariant 3D Adversarial Point Clouds
Learning Pixel-Level Distinctions for Video Highlight Detection
Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation
ADAS: A Direct Adaptation Strategy for Multi-Target Domain Adaptive Semantic Segmentation
PSTR: End-to-End One-Step Person Search With Transformers
Towards real-world navigation with deep differentiable planners
Multi-class Token Transformer for Weakly Supervised Semantic Segmentation
Fourier Document Restoration for Robust Document Dewarping and Recognition
Neural RGB-D Surface Reconstruction
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation
Spatio-Temporal Gating-Adjacency GCN for Human Motion Prediction
What Matters For Meta-Learning Vision Regression Tasks?
Self-supervised Learning of Adversarial Examples: Towards Good Generalizations for Deepfake Detection
Ray Priors through Reprojection: Improving Neural Radiance Fields for Novel View Extrapolation
Perception Prioritized Training of Diffusion Models
Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving
Human Trajectory Prediction with Momentary Observation
General Facial Representation Learning in a Visual-Linguistic Manner
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Contextual Outpainting with Object-level Contrastive Learning
Optical Flow Estimation for Spiking Camera
PointCLIP: Point Cloud Understanding by CLIP
Large scale pre-training for person re-identification with noisy labels
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection
Blended Diffusion for Text-driven Editing of Natural Images
CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping
Finding Fallen Objects Via Asynchronous Audio-Visual Integration
HeadNeRF: A Real-time NeRF-Based Parametric Head Model
Interacting Attention Graph for Single Image Two-Hand Reconstruction
Learning based Multi-modality Image and Video Compression
DR.VIC: Decomposition and Reasoning for Video Individual Counting
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection
BaLeNAS: Differentiable Architecture Search via Bayesian Learning Rule
Task Adaptive Parameter Sharing for Multi-Task Learning
ViM: Out-Of-Distribution with Virtual-logit Matching
Pyramid Adversarial Training Improves ViT Performance
Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows
Part-based Pseudo Label Refinement for Unsupervised Person Re-identification
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions
Consistent Explanations by Constrastive Learning
FvOR: Robust Joint Shape and Pose Optimization for Few-view Object Reconstruction
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
Frame Averaging for Equivariant Shape Space Learning
iFS-RCNN: An Incremental Few-shot Instance Segmenter
Bring Evanescent Representations to Life in Lifelong Class Incremental Learning
Text to Image Generation with Semantic-Spatial Aware GAN
Real-Time Light-Weight Near-Field Photometric Stereo
DESTR: Object Detection with Split Transformer
Backdoor Attacks on Self-Supervised Learning
Diverse Image Outpainting via GAN Inversion
High-Resolution Image Synthesis with Latent Diffusion Models
NFormer: Robust Person Re-identification with Neighbor Transformer
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data
SceneSqueezer: Learning to Compress Scene for Camera Relocalization
Dancing under the stars: video denoising in starlight
Tracking People by Predicting 3D Appearance, Location and Pose
BCOT: A Markerless High-Precision 3D Object Tracking Benchmark
Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture
CVF-SID: Cyclic multi-Variate Function for Self-Supervised Image Denoising by Disentangling Noise from Image
Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild
BodyGAN: General-purpose Controllable Neural Human Body Generation
Training-free Transformer Architecture Search
Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification
Single-Photon Structured Light
Towards Practical Certifiable Patch Defense with Vision Transformer
On Generalizing Beyond Domains in Cross-Domain Continual Learning
Practical Learned Lossless JPEG Recompression with Multi-Level Cross-Channel Entropy Model in the DCT Domain
GazeOnce: Real-Time Multi-Person Gaze Estimation
RendNet: Unified 2D/3D Recognizer with Latent Space Rendering
Identifying Ambiguous Similarity Conditions via Semantic Matching
Learn from Others and Be Yourself in Heterogeneous Federated Learning
Enhancing Face Recognition with Self-Supervised 3D Reconstruction
Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video
ACPL: Anti-curriculum Pseudo-labelling for Semi-supervised Medical Image Classification
The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization
Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation
Directional Self-supervised Learning for Heavy Image Augmentations
CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild
Cross-patch Dense Contrastive Learning for Semi-supervised Segmentation of Cellular Nuclei in Histopathologic Images
Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
UCC: Uncertainty guided Cross-head Co-training for Semi-Supervised Semantic Segmentation
Few-Shot Object Detection with Fully Cross-Transformer
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Unsupervised Visual Representation Learning by Online Constrained K-Means
Contextual Debiasing for Visual Recognition with Causal Mechanisms
Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes
Towards Accurate Facial Landmark Detection via Cascaded Transformers
DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow
Critical Regularizations for Neural Surface Reconstruction in the Wild
Per-Clip Video Object Segmentation
CAFE: Learning to Condense Dataset by Aligning Features
ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis
SphereSR: 360° Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation
Learning to Restore 3D Face from In-the-Wild Degraded Images
BEVT: BERT Pretraining of Video Transformers
A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting
Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion
MSTR: Mutli-Scale Transformer for End-to-End Human-Object Interaction Detection
Synthetic Aperture Imaging with Events and Frames
AP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot Network
Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information
Lepard: Learning partial point cloud matching in rigid and deformable scenes
Neural Compression-Based Feature Learning for Video Restoration
Learning to Collaborate in Decentralized Learning of Personalized Models
Rethinking Parsing Branch for Human Densepose Estimation
Collaborative Transformers for Grounded Situation Recognition
ISNet: Shape Matters for Infrared Small Target Detection
Bi-level Doubly Variational Learning for Energy-based Latent Variable Models
PSMNet: Position-aware Stereo Merging Network for Room Layout Estimation
Bi-level Alignment for Cross-Domain Crowd Counting
Unsupervised Homography Estimation with Coplanarity-Aware GAN
Real-time Object Detection for Streaming Perception
Neural Window Fully-connected CRFs for Monocular Depth Estimation
Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection
Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing
Shadows can be Dangerous: Stealthy and Effective Physical-world Adversarial Attack by Natural Phenomenon
Towards Understanding Adversarial Robustness of Optical Flow Networks
Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation
A Continuous Video Generator with the Price, Quality and Perks of StyleGAN2
Self-Supervised Learning of Object Parts for Semantic Segmentation
High-Resolution Image Harmonization via Collaborative Dual Transformations
Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
FIFO: Learning Fog-invariant Features for Foggy Scene Segmentation
Forecasting Characteristic 3D Poses of Human Actions
Equalized Focal Loss for Dense Long-tailed Object Detection
Style Neophile: Constantly Seeking Novel Styles for Domain Generalization
Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-based 3D Hand Pose and Mesh Estimation
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
Correlation Verification for Image Retrieval
Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization
UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection
Multi-View Mesh Reconstruction with Neural Deferred Shading
SoftCollage: A Differentiable Probabilistic Tree Generator for Image Collage
OVE6D: Object Viewpoint Encoding For Depth-based 6D Object Pose Estimation
Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
Image Disentanglement Autoencoder for Steganography without Embedding
Gated2Gated: Self-Supervised Depth Estimation from Gated Images
Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
The Probabilistic Normal Epipolar Constraint for Frame-To-Frame Rotation Optimization under Uncertain Feature Positions
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching
Enhancing Classifier Conservativeness and Robustness by Polynomiality
Raw High-Definition Radar for Multi-Task Learning
Self-Supervised Image Representation Learning with Geometric Set Consistency
Multi-View Transformer for 3D Visual Grounding
Semiconductor Defect Detection by Hybrid Classical-Quantum Deep Learning
Attention Reveals Occlusions
Revisiting Domain Generalized Stereo Matching Networks from a Feature Consistency Perspective
Chi-transformer: Towards Reliable Stereo From Cues
NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning
SwapMix: Diagnosing and Regularizing the Over-reliance on Visual Context in Visual Question Answering
Learning Part Segmentation through Unsupervised Domain Adaptation from Synthetic Vehicles
CellTypeGraph: A New Geometric Computer Vision Benchmark
Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets
End-to-End Semi-Supervised Learning for Video Action Detection
Parameter-free Online Test-time Adaptation
3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces
Dual-Key Multimodal Backdoors for Visual Question Answering
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective
RePaint: Inpainting using Denoising Diffusion Probabilistic Models
Improving GAN Equilibrium by Raising Spatial Awareness
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
A variational Bayesian method for similarity learning in non-rigid image registration
Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data
Adaptive Trajectory Prediction via Transferable GNN
Learning to Learn across Diverse Data Biases in Deep Face Recognition
RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding
Total Variation Optimization Layers for Computer Vision
Transforming Model Prediction for Tracking
Human Mesh Recovery from Multiple Shots
FastDOG: Fast Discrete Optimization on GPU
Estimating Example Difficulty using Variance of Gradients
Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation
Scale-Equivalent Distillation for Semi-Supervised Object Detection
Long-term Visual Map Sparsification with Heterogeneous GNN
ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning
Fast Point Transformer
Sketch3T: Test-time Training for Zero-Shot SBIR
Generative Flows with Invertible Attentions
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding
A Dual Weighting Label Assignment Scheme for Object Detection
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
Explore the Spatio-temporal Aggregation for Insubstantial Object Detection：Benchmark Dataset and Baseline
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information
DGECN: A Depth-Guided Edge Convolutional Network For End-to-End 6D Pose Estimation
BNUDC: A Two-Branched Deep Neural Network for Restoring Images from Under-Display Cameras
Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation
Hallucinated Neural Radiance Fields in the Wild
The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration
Deep Depth from Focus with Differential Focus Volume
Towards Layer-wise Image Vectorization
Robust Federated Learning with Noisy and Heterogeneous Clients
Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis
Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation
Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training
It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher
VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation
Rethinking Spatial Invariance of Convolutional Networks for Object Counting
Self-supervised Correlation Mining Network for Person Image Generation
ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation
Exploring Effective Data for Surrogate Training Towards Black-box Attack
Contrastive Learning for Space-Time Correspondence via Self-cycle Consistency
Accelerating Video Object Segmentation with Compressed Video
Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory
Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis
Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo
LISA: Learning Implicit Shape and Appearance of Hands
GIQE: Generic Image Quality Enhancement via N$^{th}$ Order Iterative Degradation
Continual Learning for Visual Search with Backward Consistent Feature Embedding
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes
Differentiable Stereopsis: Meshes from multiple views using differentiable rendering
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation
Arbitrary-Scale Image Synthesis
CRIS: CLIP-Driven Referring Image Segmentation
ShapeFormer: Transformer-based Shape Completion via Sparse Representation
Quantifying Societal Bias Amplification in Image Captioning
Omni-DETR: Omni-Supervised Object Detection with Transformers
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
Cross-Architecture Self-supervised Video Representation Learning
Feature Erasing and Diffusion Network for Occluded Person Re-Identification
Styleformer: Transformer based Generative Adversarial Networks with Style Vector
A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty
360-Attack: Distortion-Aware Perturbations from Perspective-Views
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing
NICE-SLAM: Neural Implicit Scalable Encoding for SLAM
FIBA: Frequency-Injection based Backdoor Attack in Medical Image Analysis
Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification
Continual Predictive Learning from Videos
BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning
Learning to Zoom Inside Camera Imaging Pipeline
TeachAugment: Data Augmentation Optimization Using Teacher Knowledge
PhyIR: Physics-based Inverse Rendering for Panoramic Indoor Images
Finding Good Configurations of Planar Primitives in Unorganized Point Clouds
Towards Better Understanding Attribution Methods
B-cos Networks: Alignment is All We Need for Interpretability
TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed
Learning Invisible Markers for Hidden Codes in Offline-to-online Photography
Learning Distinctive Margin toward Active Domain Adaptation
Adiabatic Quantum Computing for Multi Object Tracking
Learnable Lookup Table for Neural Network Quantization
Artistic Style Discovery With Independent Components
Occlusion-Aware Cost Constructor for Light Field Depth Estimation
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
Which Model to Transfer? Finding the Needle in the Growing Haystack
Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction
Neural Points: Point Cloud Representation with Neural Fields
C$^2$AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation
RCP: Recurrent Closest Point for Point Cloud
Label, Verify, Correct: A Simple Few-Shot Object Detection Method
Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
Dual-Generator Face Reenactment
BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation
InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering
Balanced Contrastive Learning for Long-Tailed Visual Recognition
The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Pose-aware Convolution
Partially Does It: Towards Scene-Level FG-SBIR with Partial Input
Source-Free Object Detection by Learning to Overlook Domain Style
Region-Aware Face Swapping
COOPERNAUT: End-to-End Driving with Cooperative Perceptionfor Networked Vehicles
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters
Efficient Large-scale Localization by Global Instance Recognition
All-photon Polarimetric Time-of-Flight Imaging
Parametric Scattering Networks
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Coarse-to-Fine Feature Mining for Video Semantic Segmentation
Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation
Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality
Rethinking Visual Geo-localization for Large-Scale Applications
Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps
Balanced and Hierarchical Relation Learning for One-shot Object Detection
High-Fidelity GAN Inversion for Image Attribute Editing
Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC
I M Avatar: Implicit Morphable Head Avatars from Videos
Proactive Image Manipulation Detection
Text Spotting Transformers
Learning a Structured Latent Space for Unsupervised Point Cloud Completion
PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models
Grounding Answers for Visual Questions Asked by Visually Impaired People
Efficient Classification of Very Large Images with Tiny Objects
Leveraging Adversarial Examples to Quantify Membership Information Leakage
Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
When to Prune? A Policy towards Early Structural Pruning
Robust Optimization as Data Augmentation for Large-scale Graphs
Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Harmony: A Generic Unsupervised Approach for Disentangling Semantic Content from Parameterized Transformations
The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement
Noise2NoiseFlow: Realistic Camera Noise Modeling without Clean Images
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision
Virtual Elastic Objects
StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning
Self-supervised Neural Articulated Shape and Appearance Models
A Self-Supervised Descriptor for Image Copy Detection
Rethinking Deep Face Restoration
Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes
Rethinking Controllable Variational Autoencoders
Convolutions for Spatial Interaction Modeling
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
AdaFace: Quality Adaptive Margin for Face Recognition
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Active Learning by Feature Mixing
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs
Towards Better Plasticity-Stability Trade-off in Incremental Learning: A Simple Linear Connector
Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis
Knowledge-Driven Self-Supervised Representation Learning for Facial Action Unit Recognition
Coupling Vision and Proprioception for Navigation of Legged Robots
URetinex-Net: Retinex-based Deep Unfolding Network for Low-light Image Enhancement
Modeling Image Composition for Complex Scene Generation
Think Twice Before Detecting GAN-generated Fake Images from their Spectral Domain Imprints
Undoing the Damage of Label Shift for Cross-domain Semantic Segmentation
Implicit Motion Handling for Video Camouflaged Object Detection
Contrastive Conditional Neural Processes
Exploring Set Similarity for Dense Self-supervised Representation Learning
E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential Equations
Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining
CycleMix: A Holistic Strategy for Medical Image Segmentation from Scribble Supervision
Mixed Multimodal Tokens for Vision Transformers
Rethinking the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance with Expanded Views
AirObject: A Temporally Evolving Graph Embedding for Object Identification
Balanced Multimodal Learning via On-the-fly Gradient Modulation
Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization
Computing Wasserstein-$p$ Distance Between Images with Linear Cost
Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video
Feature Statistics Mixing Regularization for Generative Adversarial Networks
Expressive Talking Head Generation with Granular Audio-Visual Control
Geometric Anchor Correspondence Mining with Uncertainty Modelling for Universal Domain Adaptation
OSSO: Obtaining Skeletal Shape from Outside
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
GIRAFFE HD: A High-Resolution 3D-aware Generative Model
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism
Pixel screening based intermediate correction for blind deblurring
LAS-AT: Adversarial Training with Learnable Attack Strategy
Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes
Moving Window Regression: A Novel Approach to Ordinal Regression
SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration
APRIL: Finding the Achilles' Heel on Privacy Leakage for Vision Transformers
Eigencontours: Novel Contour Descriptors Based on Low-Rank Approximation
Cross-modal Background Suppression for Audio-Visual Event Localization
WebQA: Multihop and Multimodal QA
Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models
Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation
Active Learning for Open-set Annotation
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation
Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation
Relative Pose from a Calibrated and an Uncalibrated Smartphone Image
Learning Optical Flow with Kernel Patch Attention
Contrastive Learning for Unsupervised Video Highlight Detection
ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior
MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Videos Similarity Evaluation
Discrete time convolution for fast event-based stereo
Proper Reuse of Image Classification Features Improves Object Detection
Object-Region Video Transformers
Vision-Language Pre-Training for Boosting Scene Text Detectors
Bandits for Structure Perturbation-based Black-box Attacks to Graph Neural Networks with Theoretical Guarantees
Revisiting Large Kernel Design in Convolutional Neural Networks
Generating High Fidelity Data from Low-density Regions using Diffusion Models
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars
Learning Visual-Semantic Explanations of Deep Visual Latent Representations
StyleMesh: Style Transfer for Indoor 3D Scene Reconstructions
Probing Representation Forgetting in Supervised and Unsupervised Continual Learning
Light Field Neural Rendering
ROCA: Robust CAD Model Retrieval and Alignment from a Single Image
Pix2NeRF: Unsupervised Conditional pi-GAN for Single Image to Neural Radiance Fields Translation
Non-Iterative Recovery from Nonlinear Observations using Generative Models
Forecasting from LiDAR via Future Object Detection
Towards Total Recall in Industrial Anomaly Detection
Low-Resource Adaptation for Personalized Co-Speech Gesture Generation
Integrating Language Guidance into Vision-based Deep Metric Learning
Non-isotropy Regularization for Proxy-based Deep Metric Learning
Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision
Less is More: Generating Grounded Navigation Instructions from Landmarks
Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis
Performance-Aware Mutual Knowledge Distillation for Improving Neural Architecture Search
End-to-End Reconstruction-Classification Learning for Face Forgery Detection
UKPGAN: A General Self-Supervised Keypoint Detector
C2SLR: Consistency-enhanced Continuous Sign Language Recognition
Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution
Style Transformer for Image Inversion and Editing
Uformer: A General U-Shaped Transformer for Image Restoration
Speech Driven Tongue Animation
DO-GAN: A Double Oracle Framework for Generative Adversarial Networks
IntentVizor: Towards Generic Query Guided Interactive Video Summarization
Self-supervised Deep Image Restoration via Adaptive Stochastic Gradient Langevin Dynamics
Sound-Guided Semantic Image Manipulation
Adaptive Gating for Single-Photon 3D Imaging
Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection
GaTector: A Unified Framework for Gaze Object Prediction
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
Anomaly Detection via Reverse Distillation from One-Class Embedding
Dynamic 3D Gaze from Afar: Deep Gaze Estimation from Temporal Eye-Head-Body Coordination
Maximum Consensus by Weighted Influences of Monotone Boolean Functions
Beyond Fixation: Dynamic Window Visual Transformer
Dressing in the Wild by Watching Dance Videos
Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers
Contrastive Boundary Learning for Point Cloud Segmentation
Proto2Proto: Can you recognize the car, the way I do?
Bridged Transformer for Vision and Point Cloud 3D Object Detection
V2C: Visual Voice Cloning
An Efficient Training Approach for Very Large Scale Face Recognition
SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition
Task Discrepancy Maximization for Fine-grained Few-Shot Classification
Reflection and Rotation Symmetry Detection via Equivariant Learning
Self-Supervised Equivariant Learning for Oriented Keypoint Detection
Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input
3DeformRS: Certifying Spatial Deformations on Point Clouds
DiGS : Divergence guided shape implicit neural representation for unoriented point clouds
UNICON: Combating Label Noise Through Uniform Selection and Contrastive Learning
Vision Transformer with Deformable Attention
Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation
Industrial Style Transfer with Large-scale Geometric Warping and Content Preservation
Hierarchical Modular Network for Video Captioning
Optimal LED Spectral Multiplexing for NIR2RGB Translation
Exploring Frequency Adversarial Attacks for Face Forgery Detection
LAR-SR: A Local Autoregressive Model for Image Super Resolution
What do navigation agents learn about their environment?
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
Entropy-based Active Learning for Object Detection with Progressive Diversity Constraint
Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation
Swin Transformer V2: Scaling Up Capacity and Resolution
Knowledge Distillation via the Target-aware Transformer
Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources
Exemplar-based Pattern Synthesis with Implicit Periodic Field Network
RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds by Local Rigidity Prior
Weakly Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation
E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Spiking Transformers for Event-based Single Object Tracking
Few-Shot Incremental Learning for Label-to-Image Translation
CD^2-pFed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated Learning
OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization
Speed up Object Detection on Gigapixel-level Image with Patch Arrangement
Learning Adaptive Warping for Real-World Rolling Shutter Correction
Robust and Accurate Superquadric Recovery: a Probabilistic Approach
SimVP: Simpler yet Better Video Prediction
Hyperspherical Consistency Regularization
Dense Depth Priors for Neural Radiance Fields from Sparse Input Views
HyperInverter: Improving StyleGAN Inversion via Hypernetwork
Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection
Whose Hands are These? Hand Detection and Hand-Body Association in the Wild
Blind Face Restoration via Integrating Face Shape and Generative Priors
Multimodal Material Segmentation
Do explanation methods explain? Model knows best
Deep Hybrid Models for Out-of-Distribution Detection
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetics
Detecting Camouflaged Object in Frequency Domain
Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection
Appearance and Structure Aware Robust Deep Visual Graph Matching: Attack, Defense and Beyond
PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects
HINT: Hierarchical Neuron Concept Explainer
Vox2Cortex: Fast Explicit Reconstruction of Cortical Surfaces from 3D MRI Scans with Geometric Deep Neural Networks
Generative Cooperative Learning for Unsupervised Video Anomaly Detection
Panoptic, Instance and Semantic Relations: A Relational Context Encoder to Enhance Panoptic Segmentation
Object-Relation Reasoning Graph for Action Recognition
Lifelong Graph Learning
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation
Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search
Rethinking Minimal Sufficient Representation in Contrastive Learning
Physical Simulation Layer for Accurate 3D Modeling
Image Animation with Perturbed Masks
Sparse to Dense Dynamic 3D Facial Expression Generation
AIM: an Auto-Augmenter for Images and Meshes
PlanarRecon: Real-time 3D Plane Detection and Reconstruction from Posed Monocular Videos
Modular Action Concept Grounding in Semantic Video Prediction
Generating Representative Samples for Few-Shot Classification
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings
Sequential Voting with Relational Box Fields for Active Object Detection
Are Multimodal Transformers Robust to Missing Modality?
Debiased Learning from Naturally Imbalanced Pseudo-Labels
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos
Learning to deblur using light field generated and real defocus images
TOAD: Topologically-Aware Deformation Fields for Single-view 3D Reconstruction
An Empirical Study of Training End-to-End Vision-and-Language Transformers
PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions
The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference
Imposing Consistency for Optical Flow Estimation
Generating Diverse 3D Reconstructions from a Single Occluded Face Image
RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks
3D Moments from Near-Duplicate Photos
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
MatteFormer: Transformer-Based Image Matting via Prior-Tokens
Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes
Learning Bayesian Sparse Networks with Full Experience Replay for Continual Learning
Category-Aware Transformer Network for Better Human-Object Interaction Detection
Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way
UNIST: Unpaired Neural Implicit Shape-to-Shape Translation
REGTR: End-to-end Point Cloud Correspondences with Transformers
Show, Deconfound and Tell: Image Captioning with Causal Inference
DeepFake Disrupter: The Detector of DeepFake Is My Friend
Lite Vision Transformer with Enhanced Self-Attention
Bi-directional Object-context Prioritization Learning for Saliency Ranking
OSKDet: Orientation-sensitive Keypoint Localization for Rotated Object Detection
Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification
Invariant Grounding for Video Question Answering
Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning
Learning Robust Image-Based Rendering on Sparse Scene Geometry via Depth Completion
FENeRF: Face Editing in Neural Radiance Fields
A Probabilistic Graphical Model Based on Neural-symbolic Reasoning for Visual Relationship Detection
CVNet: Contour Vibration Network for Building Extraction
What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions
Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design
ABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution Photo
Does Robustness on ImageNet Transfer to Downstream Tasks?
Crowd Counting in the Frequency Domain
SimMIM: A Simple Framework for Masked Image Modeling
GrainSpace: A Large-scale Dataset for Fine-grained and Domain-adaptive Recognition of Cereal Grains
End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps
MPViT : Multi-Path Vision Transformer for Dense Prediction
Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer
ARCS: Accurate Rotation and Correspondence Search
Ranking Distance Calibration for Cross-Domain Few-Shot Learning
MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning
Fisher Information Guidance for Learned Time-of-Flight Imaging
Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer
MotionAug: Augmentation with Physical Correction for Human Motion Prediction
Deep Color Consistent Network for Low-Light Image Enhancement
Non-Probability Sampling Network for Stochastic Human Trajectory Prediction
GCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors
Improving Adversarial Transferability via Neuron Attribution-Based Attacks
HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction
Pooling Revisited: Your Receptive Field is Sub-optimal
Compressing Models with Few Samples: Mimicking then Replacing
Shape from Thermal Radiation: Passive Ranging Using Multi-spectral LWIR Measurements
Layered Depth Refinement with Mask Guidance
Highly-efficient Incomplete Large-scale Multi-view Clustering with Consensus Bipartite Graph
Scaling Up Vision-Language Pretraining for Image Captioning
Optimal Correction Cost for Object Detection Evaluation
Deformable Video Transformer
High-fidelity Monocular Human Reconstruction by Combining Implicit and Explicit Representations
Nonlocal Sparse CRF
Long-Short Temporal Contrastive Learning of Video Transformers
QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation
All-In-One Image Restoration for Unknown Corruption
Learning to Detect Scene Landmarks for Camera Localization
WildNet: Learning Domain Generalized Semantic Segmentation from the Wild
Pushing the Envelope of Gradient Boosting Forests via Globally-Optimized Oblique Trees
Egocentric Scene Understanding via Multimodal Spatial Rectifier
OSSGAN: Open-Set Semi-Supervised Image Generation
Large-scale Video Panoptic Segmentation in the Wild: A Benchmark
Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning
β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search
Stereo Depth from Events Cameras: Concentrate and Focus on the Future
Transferable Sparse Adversarial Attack
FAM: Visual Explanations for the Feature Representations from Deep Convolutional Networks
Noise-Aware NeRFs for Burst-Denoising
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Bayesian Invariant Risk Minimization
Extracting Triangular 3D Models, Materials, and Lighting From Images
RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition
Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution
SphericGAN: Semi-supervised Hyper-spherical Generative Adversarial Networks for Fine-grained Image Synthesis
LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition
Unifying Panoptic Segmentation for Autonomous Driving
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Interspace Pruning: Using Adaptive Filter Representations to Improve Training of Sparse CNNs
NightLab: A Dual-level Architecture with Hardness Detection for Segmentation at Night
Learning to Memorize Feature Hallucination for One-Shot Image Generation
FedCorr: Multi-Stage Federated Learning for Label Noise Correction
GeoNeRF: Generalizing NeRF with Geometry Priors
Neural 3D Video Synthesis
TransforMatcher: Match-to-Match Attention for Semantic Correspondence
Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting
AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval
Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase.
Burst Image Restoration and Enhancement
Modeling Indirect Illumination for Inverse Rendering
Knowledge Mining with Scene Text for Fine-Grained Recognition
FlexIT: Towards Flexible Semantic Image Translation
Surpassing the Human Accuracy: Detecting Gallbladder Cancer from USG Images with Curriculum Learning
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning
Multi-Person Extreme Motion Prediction
Does text attract attention on e-commerce images: A novel saliency prediction dataset and method
Instance-Aware Dynamic Neural Network Quantization
Energy-based Latent Aligner for Incremental Learning
Semi-supervised Video Paragraph Grounding with Contrastive Encoder
Personalized Image Aesthetics Assessment with Rich Attributes
Attention Concatenation Volume for Accurate and Efficient Stereo Matching
Split Hierarchal Variational Compression
MS2DG-Net: Progressive Correspondence Learning via Multi Sparse Semantic Dynamic Graph
Large Loss Matters in Weakly Supervised Multi-Label Classification
Recurring the Transformer for Video Action Recognition
Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator
KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning
Hyperbolic Vision Transformers: Combining Improvements in Metric Learning
Camera Pose Estimation using Implicit Distortion Models
A Structured Dictionary Perspective on Implicit Neural Representations
ST-MFNet: A Spatio-Temporal Multi-Flow Network for Frame Interpolation
Geometric Structure Preserving Warp for Natural Image Stitching
Slimmable Domain Adaptation
Meta Convolutional Neural Networks for Single Domain Generalization
Label Matching Semi-Supervised Object Detection
Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning
Abandoning the Bayer-Filter to See in the Dark
Deep Hierarchical Semantic Segmentation
MixFormer: End-to-End Tracking with Iterative Mixed Attention
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics
Occlusion-robust Face Alignment using A Viewpoint-invariant Hierarchical Network Architecture
Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic Segmentation
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction
Boosting 3D Object Detection by Simulating Multimodality on Point Clouds
RADU: Ray-Aligned Depth Update Convolutions for ToF Data Denoising
Auto-Encoder is All You Need
Whose Track Is It Anyway? Improving Robustness to Tracking Errors with Affinity-Based Prediction
Multi-marginal Contrastive Learning for Multi-label Subcellular Protein Localization
Stand-Alone Inter-Frame Attention in Video Models
Hyperbolic Image Segmentation
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving
SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization
ART-Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial Rotation
Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3)
Learning to Learn and Remember Super Long Multi-Domain Task Sequence
Noise Is Also Useful: Negative Correlation-Steered Latent Contrastive Learning
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing
Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
Real World Self-Supervised Multi-Image Super-Resolution for Multi-Exposure Push-Frame Satellites
Knowledge Distillation with the Reused Teacher Classifier
Geometry-Aware Guided Loss for Deep Crack Recognition
AdaMixer: A Simple and Accurate Query-based Object Detector
Learning Structured Gaussians to Approximate Deep Ensembles
Input-level Inductive Biases for 3D Reconstruction
BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild
Stereo Magnification with Multi-Layer Images
Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection
Coherent Point Drift Revisited for Non-rigid Shape Matching and Registration
Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint
CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters
Text2Mesh: Text-Driven Neural Stylization for Meshes
RFNet: Unsupervised Network for Mutually Reinforcing Multi-modal Image Registration and Fusion
Image Dehazing Transformer with Transmission-Aware 3D Position Embedding
Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification
RGB-Multispectral Matching: Dataset, Learning Methodology, Evaluation
Maintaining Reasoning Consistency in Compositional Visual Question Answering
PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite Images
Fast Algorithm for Low-rank Tensor Completion in Delay-embedded Space
Dynamic Sparse R-CNN
Improving Robustness Against Stealthy Weight Bit-Flip Attacks by Output Code Matching
NPBG++: Accelerating Neural Point-Based Graphics
Forward Compatible Few-Shot Class-Incremental Learning
Weakly-supervised Metric Learning with Cross-Module Communications for the Classification of Anterior Chamber Angle Images
Learning Canonical F-Correlation Projection for Compact Multiview Representation
Learning Non-target Knowledge for Few-shot Semantic Segmentation
Towards Low-Cost and Efficient Malaria Detection
PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking
NeuralHDHair: Automatic High-fidelity Hair Modeling from a Single Image Using Implicit Neural Representations
ClusterGNN: Cluster-based Coarse-to-fine Graph Neural Network for Efficient Feature Matching
An Iterative Quantum Approach for Transformation Estimation from Point Sets
ATPFL: Automatic Trajectory Prediction Model Design under Federated Learning Framework
Understanding and Increasing Efficiency of Frank-Wolfe Adversarial Training
Targeted Supervised Contrastive Learning for Long-Tailed Recognition
Optimizing Elimination Templates by Greedy Parameter Search
M3T: three-dimensional Medical image classifier using Multi-plane and Multi-slice Transformer
Projective Manifold Gradient Layer for Deep Rotation Regression
PUMP: Pyramidal and Uniqueness Matching Priors for Unsupervised Learning of Local Descriptors
Deep orientation-aware functional maps : Tackling symmetry issues in Shape Matching
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation
Lite-MDETR: A Lightweight Multi-Modal Detector
Cross Modal Retrieval with Querybank Normalisation
On Learning Contrastive Representations for Learning with Noisy Labels
Cross-view transformers for real-time map-view semantic segmentation
Towards Data-Free Model Stealing in a Hard Label Setting
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting
Unseen Classes at a Later Time? No Problem
Channel Balancing for Accurate Quantization of Winograd Convolutions
Instance masks are what you need: Segmentation parity from object boundaries
TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing
Scanline Homographies for Rolling-Shutter Plane Absolute Pose
Dual-Shutter Optical Vibration Sensing
DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and Rendering
Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients
TubeR: Tubelet Transformer for Video Action Detection
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization
Contour-Hugging Heatmaps for Landmark Detection
Local Attention Pyramid for Scene Image Generation
Implicit Feature Decoupling with Depthwise Quantization
InsetGAN for Full-Body Image Generation
Recurrent Variational Network: A Deep Learning Inverse Problem Solver applied to the task of Accelerated MRI Reconstruction
Robust Invertible Image Steganography
Disentangling visual and written concepts in CLIP
Causal CLIP Fine-tuning for Fashion Product Retrieval
Accelerating Neural Network Optimization Through an Automated Control Theory Lens
Comprehending and Ordering Semantics for Image Captioning
Grounded Language-Image Pre-training
Hierarchical Self-supervised Representation Learning for Movie Understanding
RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
How Well Do Sparse ImageNet Models Transfer?
Towards Principled Disentanglement for Domain Generalization
Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Path-CNN: Topology-Aware Centerline Segmentation Using Sparse Annotation
Image Based Reconstruction of Liquids from 2D Surface Detections
Neural Convolutional Surfaces
Graph-context Attention Networks for Size-varied Deep Graph Matching
Learning to Solve Hard Minimal Problems
Neural Mesh Simplification
SPAct: Self-supervised Privacy Preservation for Action Recognition
Towards Language-free Training for Text-to-Image Generation
Rep-Net: Efficient On-Device Learning via Feature Reprogramming
3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection
TrackFormer: Multi-Object Tracking with Transformers
Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes
EnvEdit: Environment Editing for Vision-and-Language Navigation
DeepFace-EMD: Re-ranking using Patch-wise Earth Mover's Distance Improves Out-of-Distribution Face Identification
Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs
MulT: An End-to-End Multitask Learning Transformer
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection
Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework
Plenoxels: Radiance Fields without Neural Networks
Pushing the Limits of Simple Pipelines for Practical Few-Shot Learning
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning
CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data
EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning
3D Photo Stylization: Learning to Generate Stylized Novel Views from a Single Image
SIMBAR: Single Image-Based Scene Relighting For Effective Data Augmentation For Automated Driving Vision Tasks
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
VALHALLA: Visual Hallucination for Machine Translation
Learning Pairwise Affinity for Open-World Instance Segmentation
CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification
Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving
Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning
Generalized Category Discovery
Deep Image-based Illumination Harmonization
Mixed Differential Privacy in Computer Vision
MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
Weakly Supervised Rotation-Invariant Aerial Object Detection Network
Evaluation-oriented Knowledge Distillation for Deep Face Recognition
Robust Cross-Modal Representation Learning with Progressive Self-Distillation
Transformer Tracking with Cyclic Shifting Window Attention
LTP: Lane-based Trajectory Prediction for Autonomous Driving
Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers
Multi-instance Point Cloud Registration by Efficient Correspondence Clustering
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
AutoLoss-GMS: Searching Generalized Margin-based Softmax Loss Function for Person Re-identification
Convolution of Convolution: Let Kernels Spatially Collaborate
DiffPoseNet: Direct Differentiable Camera Pose Estimation
Modeling sRGB Camera Noise with Normalizing Flows
Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis
Federated Learning with Position-Aware Neurons
Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation
Point Density-Aware Voxels for LiDAR 3D Object Detection
A Conservative Approach for Unbiased Learning on Unknown Biases
The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed Classification
Symmetry-aware Neural Architecture for Embodied Visual Exploration
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
Egocentric Prediction of Action Target in 3D
What makes transfer learning work for medical images: feature reuse & other factors
Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification
Unsupervised Learning of De-biased Representation with Pseudo-bias Attribute
DECORE: Deep Compression with Reinforcement Learning
RGB-Depth Fusion GAN for Indoor Depth Completion
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Class-Aware Contrastive Semi-Supervised Learning
Learning to Prompt for Continual Learning
DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints
Self-Supervised Dense Consistency Regularization for Image-to-Image Translation
Forward Compatible Training for Large-Scale Embedding Retrieval Systems
Joint Forecasting of Panoptic Segmentations with Difference Attention
Revisiting the Transferability of Supervised Pretraining: an MLP Perspective
Disentangling Visual Embeddings for Attributes and Objects
SeeThroughNet: Resurrection of Auxiliary Loss by Preserving Class Probability Information
Neural Reflectance for Shape Recovery with Shadow Handling
Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow
XYDeblur: Divide and Conquer for Single Image Deblurring
ScePT: Scene-consistent, Policy-based Trajectory Predictions for Planning
Visual Acoustic Matching
Fair Contrastive Learning for Facial Attribute Classification
Neural Prior for Trajectory Estimation
AutoMine: An Unmanned Mine Dataset
SMARTADAPT: Multi-branch Object Detection Framework for Videos on Mobiles
Neural Face Identification in a 2D Wireframe Projection of a Manifold Object
AlignMixup: Improving Representations By Interpolating Aligned Features
Memory-Augmented Non-Local Attention for Video Super-Resolution
ESCNet: Gaze Target Detection with the Understanding of 3D Scenes
AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation
Distinguishing Unseen from Seen for Generalized Zero-shot Learning
When Does Contrastive Visual Representation Learning Work?
Privacy-preserving Online AutoML for Domain-Specific Face Detection
Robust outlier detection by de-biasing VAE likelihoods
GridShift: A Faster Mode-seeking Algorithm for Image Segmentation and Object Tracking
Continual Learning with Lifelong Vision Transformer
M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction
Stochastic Variance Reduced Ensemble Adversarial Attack for Boosting the Adversarial Transferability
Representing 3D Shapes with Probabilistic Directed Distance Fields
Restormer: Efficient Transformer for High-Resolution Image Restoration
Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification
Few-shot Learning with Noisy Labels
Co-Domain Symmetry for Complex-Valued Deep Learning
Pyramid Architecture for Multi-Scale Processing in Point Cloud Segmentation
GCR: Gradient Coreset based Replay Buffer Selection for Continual Learning
Domain Adaptation on Point Clouds via Geometry-Aware Implicits
Ranking-Based Siamese Visual Tracking
Coarse-to-Fine Disentangling Transformer for Human-Object Interaction Detection
MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis
AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural Networks
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
DTA: Physical Camouflage Attacks using Differentiable Transformation Network
Layer-wised Model Aggregation for Personalized Federated Learning
Video Swin Transformer
Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries
General Incremental Learning with Domain-aware Categorical Representations
Crafting Better Contrastive Views for Siamese Representation Learning
A Style-aware Discriminator for Controllable Image Translation
BoosterNet: Improving Domain Generalization of Deep Neural Nets using Culpability-Ranked Features
A Unified Framework for Implicit Sinkhorn Differentiation
Brain-Supervised Image Editing
Neural Shape Mating: Self-Supervised Object Assembly with Adversarial Shape Priors
Multimodal Colored Point Cloud to Image Alignment
Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction
Multi-Objective Diverse Human Motion Prediction with Knowledge Distillation
Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart
Autoregressive Image Generation using Residual Quantization
SGTR: End-to-end Scene Graph Generation with Transformer
Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer
PPDL: Predicate Probability Distribution based Loss for Unbiased Scene Graph Generation
Localized Adversarial Domain Generalization
Patch-level Representation Learning for Self-supervised Vision Transformers
KNN Local Attention for Image Restoration
Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework
DAD-3DHeads: A Large-scale Dense, Accurate and Diverse Dataset for 3D Dense Head Alignment from a Single Image
Is Mapping Necessary for Realistic PointGoal Navigation?
Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation
LiT: Zero-Shot Transfer with Locked-image text Tuning
Scaling Vision Transformers
Spatial Commonsense Graph for Object Localisation in Partial Scenes
Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video
3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
Upright-Net: Learning Upright Orientation for 3D Point Cloud
D*-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection
Differentiable Dynamics for Articulated 3d Human Motion Reconstruction
Clean Implicit 3D Structure from Noisy 2D STEM Images
MPC: Multi-view Probabilistic Clustering
Node-aligned Graph Convolutional Network for Whole-slide Image Representation and Classification
Multidimensional Belief Quantification for Label-Efficient Meta-Learning
Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection
Uni6D: A Unified CNN Framework without Projection Breakdown in 6D Pose Estimation
Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks
Enabling Equivariance for Arbitrary Lie Groups
Multi-Scale Memory-Based Video Deblurring
Privacy Preserving Partial Localization
Towards Robust and Reproducible Active Learning using Neural Networks
Marginal Contrastive Correspondence for Exemplar-based Image Translation
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repeated Action Counting
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation
FaceFormer: Speech-Driven 3D Facial Animation with Transformers
LARGE: Latent-Based Regression Through GAN Semantics
TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation
AR-NeRF: Unsupervised Learning of Depth and Defocus Effects from Natural Images with Aperture Rendering Neural Radiance Fields
CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection
SASIC: Stereo Image Compression with Latent Shifts and Stereo Attention
Controllable Animation of Fluid Elements in Still Images
Revisiting BatchNorm's Learnable Affines in Few-Shot Transfer Learning
Learning Graph Regularisation for Guided Super-Resolution
Topology Preserving Local Road Network Estimation from Single Onboard Camera Image
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning
Face2Exp: Combating Data Biases for Facial Expression Recognition
Leveraging Equivariant Features for Absolute Pose Regression
Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut
Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry
ZZ-Net: A Universal Rotation Equivariant Architecture for 2D Point Clouds
Interactive Disentanglement: Learning Concepts by Interacting with their Prototype Representations
Incremental Learning in Semantic Segmentation from Image Labels
Complex Backdoor Detection by Symmetric Feature Differencing
Constrained Few-shot Class-incremental Learning
HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet
Amodal Panoptic Segmentation
Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency
Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation
Pin the Memory: Learning to Generalize Semantic Segmentation
Long-tailed Visual Recognition via Gaussian Clouded Logit Adjustment
Knowledge distillation: A good teacher is patient and consistent
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Searching the Deployable Convolution Neural Networks for GPUs
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
Condensing CNNs with Partial Differential Equations
Adaptive Early-Learning Correction for Segmentation from Noisy Annotations
Bounded Adversarial Attack on Deep Content Features
Towards Driving-Oriented Metric for Lane Detection Models
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness
Better Trigger Inversion Optimization in Backdoor Scanning
Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers
Towards Understanding and Simplifying MoCo: Dual Temperature Helps Contrastive Learning without Many Negative Samples
Smooth Maximum Unit: Smooth Activation Function for Deep Networks using Smoothing Maximum Technique
Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer
Image Segmentation Using Text and Image Prompts
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation
Vision-Language Pre-Training with Triple Contrastive Learning
Temporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations
Globetrotter: Connecting Languages by Connecting Images
Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data
It’s Time for Artistic Correspondence in Music and Video
Equivariant Point Set Analysis via Learning Orientations for Message Passing
KeyTr: Keypoint Transporter for 3D Reconstruction of Deformable Objects in Videos
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction
MatchFAME: Fast, Accurate and Memory-Efficient Multi-Object Matching
Neural Emotion Director: Speech-preserving semantic control of facial expressions in “in-the-wild” videos
Id-Free Person Similarity Learning
Alleviating Emotional bias in Affective Image Captioning by Contrastive Data Collection
A study on the distribution of social biases in self-supervised learning visual models
Motron: Multimodal Probabilistic Human Motion Forecasting
Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders
Real-time hyperspectral imaging in hardware via trained metasurface encoders
SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis
Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos
Self-supervised Spatial Reasoning on Multi-View Line Drawings
Contrastive Test-Time Adaptation
Why Discard if You can Recycle?:A Recycling Max Pooling Module for 3D Point Cloud Analysis
Do learned representations respect causal relationships?
Zero-Query Transfer Attacks on Context-Aware Object Detectors
Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm
Contrastive Dual Gating: Learning Sparse Features With Contrastive Learning
Efficient Maximal Coding Rate Reduction by Variational Forms
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
Towards Efficient and Scalable Sharpness-Aware Minimization
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
Merry Go Round: Rotate a Frame and Fool a DNN
Label-Only Model Inversion Attacks via Boundary Repulsion
Style-Structure Disentangled Features and Normalizing Flows for Diverse Icon Colorization
How Much More Data Do I Need? Estimating Requirements For Downstream Tasks
A sampling-based approach for efficient clustering in large datasets
Deep Equilibrium Optical Flow Estimation
Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values
Multi-label Iterated Learning for Image Classification with Label Ambiguity
Cross-modal Map Learning for Vision and Language Navigation
Learning with Neighbor Consistency for Noisy Labels
Measuring Compositional Consistency for Video Question Answering
Failure Modes of Domain Generalization Algorithms
AutoRF: Learning 3D Object Radiance Fields from Single View Observations
A Unified Model for Line Projections in Catadioptric Cameras
OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
Cluster-guided Image Synthesis with Unconditional Models
Self-supervised object detection from audio-visual correspondence
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers
Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning
Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models
How much does input data type impact final face model accuracy?
Certified Patch Robustness via Smoothed Vision Transformers
PubTables-1M: Towards comprehensive table extraction from unstructured documents
Fine-tuning Image Transformers using Learnable Memory
GuideFormer: Transformers for Image Guided Depth Completion
Motion-Adjustable Neural Implicit Video Representation
LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds
Multi-modal Alignment using Representation Codebook
NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge
Investigating Top-$k$ White-Box and Transferable Black-box Attack
GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision
On the Instability of Relative Pose Estimation and RANSAC’s Role
Dual Task Learning by Leveraging Both Dense Correspondence and Mis-Correspondence for Robust Change Detection With Imperfect Matches
M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers
Dynamic Scene Graph Generation via Anticipatory Pre-training
ScanQA: 3D Question Answering for Spatial Scene Understanding
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Large Images as Long Documents: Hierarchical ViTs with Self-Supervised Pretraining in Gigapixel Image Pyramids
Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection
On Guiding Visual Attention with Language Specification
OnePose: One-Shot Object Pose Estimation without CAD Models
Thin-Plate Spline Motion Model for Image Animation
PokeBNN: A Binary Pursuit of Lightweight Accuracy
Semi-Supervised Few-shot Learning via Multi-Factor Clustering
FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
CLIPstyler: Image Style Transfer with a Single Text Condition
Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions
Out-of-distribution Generalization with Causal Invariant Transformations
Zero-Shot Text-Guided Object Generation with Dream Fields
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization
NICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation Models
Deep Unlearning via Randomized Conditionally Independent Hessians
Multi-Modal Dynamic Graph Transformer for Visual Grounding
Propagation Regularizer for Semi-supervised Learning with Extremely Scarce Labeled Samples
Discrete Wasserstein Distributional Matching for Quantization in Image Hashing
Robust fine-tuning of zero-shot models
Probabilistic Representations for Video Contrastive Learning
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction
Fine-Grained Object Classification via Self-Supervised Pose Alignment
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
A Framework for Learning Ante-hoc Explainable Models via Concepts
Retrieval Augmented Classification for Long Tail Visual Recognition
Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
Learning Video Representations of Human Motion from Synthetic Data
Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation
Efficient Deep Embedded Subspace Clustering
Local-Adaptive Face Recognition via Graph-based Meta-Clustering and Regularized Adaptation
GenDR: A Generalized Differentiable Renderer
Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations
Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Toward a Unified Model