This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes ...This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japan...This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japanese characters. Next bonnding boxes are processed by a new “Expand, Break and Merge” (EBM) method to get the precise text areas. Finally, text is binarized by a hybrid method based on Otsu and Niblack. This new approach can extract different kinds of text from complicated natural scenes. It is insensitive to noise, distortedness, and text orientation. It also has good performance on extracting texts in various sizes.展开更多
Power-line networks are designed to deliver electricity. They reach most of the domiciles and other buildings nowadays, so most of the people have access to it. On the other hand the backbone for the communications ne...Power-line networks are designed to deliver electricity. They reach most of the domiciles and other buildings nowadays, so most of the people have access to it. On the other hand the backbone for the communications networks is not available in all countries especially the developing ones. A high cost and changing the design for the networks may be needed to construct this backbone. If data can be transmitted over the power-line networks, a recognized cost and time save can be achieved. In Egypt, the infrastructure is not always available for constructing a communications network backbone due to the already designed buildings before the need for these backbones. In this paper, we overcome this problem by designing a reliable Power-line Modem that operates safely on the low voltage grid. The modem is based on the Direct Sequence Spread Spectrum technique. It uses the mains zero crossing as an efficient way for the synchronization between the transmitter and the receiver. The Modem takes into account the problems of the Power-line including noise, attenuation and impedance dismatching.展开更多
Precise polyp segmentation is vital for the early diagnosis and prevention of colorectal cancer(CRC)in clinical practice.However,due to scale variation and blurry polyp boundaries,it is still a challenging task to ach...Precise polyp segmentation is vital for the early diagnosis and prevention of colorectal cancer(CRC)in clinical practice.However,due to scale variation and blurry polyp boundaries,it is still a challenging task to achieve satisfactory segmentation performance with different scales and shapes.In this study,we present a novel edge-aware feature aggregation network(EFA-Net)for polyp segmentation,which can fully make use of cross-level and multi-scale features to enhance the performance of polyp segmentation.Specifically,we first present an edge-aware guidance module(EGM)to combine the low-level features with the high-level features to learn an edge-enhanced feature,which is incorporated into each decoder unit using a layer-by-layer strategy.Besides,a scale-aware convolution module(SCM)is proposed to learn scale-aware features by using dilated convolutions with different ratios,in order to effectively deal with scale variation.Further,a cross-level fusion module(CFM)is proposed to effectively integrate the cross-level features,which can exploit the local and global contextual information.Finally,the outputs of CFMs are adaptively weighted by using the learned edge-aware feature,which are then used to produce multiple side-out segmentation maps.Experimental results on five widely adopted colonoscopy datasets show that our EFA-Net outperforms state-of-the-art polyp segmentation methods in terms of generalization and effectiveness.Our implementation code and segmentation maps will be publicly at https://github.com/taozh2017/EFANet.展开更多
Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexi...Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.展开更多
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa...We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.展开更多
Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in ...Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in contribution between different-level features,and(2)designing an effective mechanism for fusing these features.Unlike existing CNN-based methods,we adopt a transformer encoder,which learns more powerful and robust representations.In addition,considering the image acquisition influence and elusive properties of polyps,we introduce three standard modules,including a cascaded fusion module(CFM),a camouflage identification module(CIM),and a similarity aggregation module(SAM).Among these,the CFM is used to collect the semantic and location information of polyps from high-level features;the CIM is applied to capture polyp information disguised in low-level features,and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area,thereby effectively fusing cross-level features.The proposed model,named Polyp-PVT,effectively suppresses noises in the features and significantly improves their expressive capabilities.Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations(e.g.,appearance changes,small objects,and rotation)than existing representative methods.The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.展开更多
Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization.However,with limited resources,it is challenging to determine the best type of annotati...Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization.However,with limited resources,it is challenging to determine the best type of annotations when annotating massive amounts of unlabeled data.To address this issue,we focus on polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans;Both applications require significant effort and time for pixel-wise annotation due to the high dimensional nature of the data,involving either temporary or spatial dimensions.In this paper,we develop a new annotation strategy,termed Drag&Drop,which simplifies the annotation process to drag and drop.This annotation strategy is more efficient,particularly for temporal and volumetric imaging,than other types of weak annotations,such as per-pixel,bounding boxes,scribbles,ellipses and points.Furthermore,to exploit our Drag&Drop annotations,we develop a novel weakly supervised learning method based on the watershed algorithm.Experimental results show that our method achieves better detection and localization performance than alternative weak annotations and,more importantly,achieves similar performance to that trained on detailed per-pixel annotations.Interestingly,we find that,with limited resources,allocating weak annotations from a diverse patient population can foster models more robust to unseen images than allocating per-pixel annotations for a small set of images.In summary,this research proposes an efficient annotation strategy for tumor detection and localization that is less accurate than per-pixel annotations but useful for creating large-scale datasets for screening tumors in various medical modalities.展开更多
This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we ad...This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we add a context token to the input sequence,to facilitate information exchange with tokens corresponding to image patches throughout transformer layers.Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions,we propose a token-attention module(TAM)to recalibrate encoded features through channel-wise attention informed by the context token.Beyond that,it is adopted to predict the total person count of the image through regression-token module(RTM).Extensive experiments on various datasets,including ShanghaiTech,UCFQNRF,JHU-CROWD++and NWPU,demonstrate that the proposed context extraction techniques can significantly improve the performanceover the baselines.展开更多
While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising,existing methods mostly rely on simple noise assumptions,such as additive white Gaussian noise(AWG...While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising,existing methods mostly rely on simple noise assumptions,such as additive white Gaussian noise(AWGN),JPEG compression noise and camera sensor noise,and a general-purpose blind denoising method for real images remains unsolved.In this paper,we attempt to solve this problem from the perspective of network architecture design and training data synthesis.Specifically,for the network architecture design,we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block,and then plug it as the main building block into the widely-used image-to-image translation UNet architecture.For the training data synthesis,we design a practical noise degradation model which takes into consideration different kinds of noise(including Gaussian,Poisson,speckle,JPEG compression,and processed camera sensor noises)and resizing,and also involves a random shuffle strategy and a double degradation strategy.Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability.We believe our work can provide useful insights into current denoising research.The source code is available at https://github.com/cszn/SCUNet.展开更多
This paper presents a novel approach for human identification at a distance using gait recognition. Recognition of a person from their gait is a biometric of increasing interest. The proposed work introduces a nonline...This paper presents a novel approach for human identification at a distance using gait recognition. Recognition of a person from their gait is a biometric of increasing interest. The proposed work introduces a nonlinear machine learning method, kernel Principal Component Analysis (PCA), to extract gait features from silhouettes for individual recognition. Binarized silhouette of a motion object is first represented by four 1-D signals which are the basic image features called the distance vectors. Fourier transform is performed to achieve translation invariant for the gait patterns accumulated from silhouette sequences which are extracted from different circumstances. Kernel PCA is then used to extract higher order relations among the gait patterns for future recognition. A fusion strategy is finally executed to produce a final decision. The experiments are carried out on the CMU and the USF gait databases and presented based on the different training gait cycles.展开更多
Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a...Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways,using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend.This paper provides the first comprehensive review and a benchmark for light field SOD,which has long been lacking in the saliency community.Firstly,we introduce light fields,including theory and data forms,and then review existing studies on light field SOD,covering ten traditional models,seven deep learning-based models,a comparative study,and a brief review.Existing datasets for light field SOD are also summarized.Secondly,we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets,providing insightful discussions and analyses,including a comparison between light field SOD and RGB-D SOD models.Due to the inconsistency of current datasets,we further generate complete data and supplement focal stacks,depth maps,and multi-view images for them,making them consistent and uniform.Our supplemental data make a universal benchmark possible.Lastly,light field SOD is a specialised problem,because of its diverse data representations and high dependency on acquisition hardware,so it differs greatly from other saliency detection tasks.We provide nine observations on challenges and future directions,and outline several open issues.All the materials including models,datasets,benchmarking results,and supplemented light field datasets are publicly available at https://github.com/kerenfu/LFSOD-Survey.展开更多
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversat...Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track record in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Specifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sensing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenarios,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting finegrained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.展开更多
Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,...Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
Interactive image segmentation(IIS)is an important technique for obtaining pixel-level annotations.In many cases,target objects share similar semantics.However,IIS methods neglect this connection and in particular the...Interactive image segmentation(IIS)is an important technique for obtaining pixel-level annotations.In many cases,target objects share similar semantics.However,IIS methods neglect this connection and in particular the cues provided by representations of previously segmented objects,previous user interaction,and previous prediction masks,which can all provide suitable priors for the current annotation.In this paper,we formulate a sequential interactive image segmentation(SIIS)task for minimizing user interaction when segmenting sequences of related images,and we provide a practical approach to this task using two pertinent designs.The first is a novel interaction mode.When annotating a new sample,our method can automatically propose an initial click proposal based on previous annotation.This dramatically helps to reduce the interaction burden on the user.The second is an online optimization strategy,with the goal of providing semantic information when annotating specific targets,optimizing the model with dense supervision from previously labeled samples.Experiments demonstrate the effectiveness of regarding SIIS as a particular task,and our methods for addressing it.展开更多
In this paper,we present a case study that performs an unmanned aerial vehicle(UAV)based fine-scale 3D change detection and monitoring of progressive collapse performance of a building during a demolition event.Multi-...In this paper,we present a case study that performs an unmanned aerial vehicle(UAV)based fine-scale 3D change detection and monitoring of progressive collapse performance of a building during a demolition event.Multi-temporal oblique photogrammetry images are collected with 3D point clouds generated at different stages of the demolition.The geometric accuracy of the generated point clouds has been evaluated against both airborne and terrestrial LiDAR point clouds,achieving an average distance of 12 cm and 16 cm for roof and façade respectively.We propose a hierarchical volumetric change detection framework that unifies multi-temporal UAV images for pose estimation(free of ground control points),reconstruction,and a coarse-to-fine 3D density change analysis.This work has provided a solution capable of addressing change detection on full 3D time-series datasets where dramatic scene content changes are presented progressively.Our change detection results on the building demolition event have been evaluated against the manually marked ground-truth changes and have achieved an F-1 score varying from 0.78 to 0.92,with consistently high precision(0.92–0.99).Volumetric changes through the demolition progress are derived from change detection and have been shown to favorably reflect the qualitative and quantitative building demolition progression.展开更多
This paper presents a wavelet-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. Kernel PC...This paper presents a wavelet-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. Kernel PCA is a technique for nonlinear dimension reduction of data with an underlying nonlinear spatial structure. The intensity values of the palmprint image are first normalized by using mean and standard deviation. The palmprint is then transformed into the wavelet domain to decompose palm images and the lowest resolution subband coefficients are chosen for palm representation. The kernel PCA method is then applied to extract non-linear features from the subband coefficients. Finally, similarity measurement is accomplished by using weighted Euclidean linear distance-based nearest neighbor classifier. Experimental results on PolyU Palmprint Databases demonstrate that the proposed approach achieves highly competitive performance with respect to the published palmprint recognition approaches.展开更多
基金supported by A*STAR Career Development Fund,Singapore(No.C233312006)。
文摘This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
文摘This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japanese characters. Next bonnding boxes are processed by a new “Expand, Break and Merge” (EBM) method to get the precise text areas. Finally, text is binarized by a hybrid method based on Otsu and Niblack. This new approach can extract different kinds of text from complicated natural scenes. It is insensitive to noise, distortedness, and text orientation. It also has good performance on extracting texts in various sizes.
文摘Power-line networks are designed to deliver electricity. They reach most of the domiciles and other buildings nowadays, so most of the people have access to it. On the other hand the backbone for the communications networks is not available in all countries especially the developing ones. A high cost and changing the design for the networks may be needed to construct this backbone. If data can be transmitted over the power-line networks, a recognized cost and time save can be achieved. In Egypt, the infrastructure is not always available for constructing a communications network backbone due to the already designed buildings before the need for these backbones. In this paper, we overcome this problem by designing a reliable Power-line Modem that operates safely on the low voltage grid. The modem is based on the Direct Sequence Spread Spectrum technique. It uses the mains zero crossing as an efficient way for the synchronization between the transmitter and the receiver. The Modem takes into account the problems of the Power-line including noise, attenuation and impedance dismatching.
基金supported in part by National Natural Science Foundation of China(Nos.62172228,62201263,62106043 and 62201265).
文摘Precise polyp segmentation is vital for the early diagnosis and prevention of colorectal cancer(CRC)in clinical practice.However,due to scale variation and blurry polyp boundaries,it is still a challenging task to achieve satisfactory segmentation performance with different scales and shapes.In this study,we present a novel edge-aware feature aggregation network(EFA-Net)for polyp segmentation,which can fully make use of cross-level and multi-scale features to enhance the performance of polyp segmentation.Specifically,we first present an edge-aware guidance module(EGM)to combine the low-level features with the high-level features to learn an edge-enhanced feature,which is incorporated into each decoder unit using a layer-by-layer strategy.Besides,a scale-aware convolution module(SCM)is proposed to learn scale-aware features by using dilated convolutions with different ratios,in order to effectively deal with scale variation.Further,a cross-level fusion module(CFM)is proposed to effectively integrate the cross-level features,which can exploit the local and global contextual information.Finally,the outputs of CFMs are adaptively weighted by using the learned edge-aware feature,which are then used to produce multiple side-out segmentation maps.Experimental results on five widely adopted colonoscopy datasets show that our EFA-Net outperforms state-of-the-art polyp segmentation methods in terms of generalization and effectiveness.Our implementation code and segmentation maps will be publicly at https://github.com/taozh2017/EFANet.
基金National Natural Science Foundation of China under Grant Nos.61672273 and 61832008Science Foundation for Distinguished Young Scholars of Jiangsu under Grant No.BK20160021+1 种基金Postdoctoral Innovative Talent Support Program of China under Grant Nos.BX20200168,2020M681608General Research Fund of Hong Kong under Grant No.27208720。
文摘Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.
文摘We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.
文摘Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in contribution between different-level features,and(2)designing an effective mechanism for fusing these features.Unlike existing CNN-based methods,we adopt a transformer encoder,which learns more powerful and robust representations.In addition,considering the image acquisition influence and elusive properties of polyps,we introduce three standard modules,including a cascaded fusion module(CFM),a camouflage identification module(CIM),and a similarity aggregation module(SAM).Among these,the CFM is used to collect the semantic and location information of polyps from high-level features;the CIM is applied to capture polyp information disguised in low-level features,and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area,thereby effectively fusing cross-level features.The proposed model,named Polyp-PVT,effectively suppresses noises in the features and significantly improves their expressive capabilities.Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations(e.g.,appearance changes,small objects,and rotation)than existing representative methods.The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
基金supported by the Lustgarten Foundation for Pancreatic Cancer Research and the Patrick J.McGovern Foundation Award.
文摘Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization.However,with limited resources,it is challenging to determine the best type of annotations when annotating massive amounts of unlabeled data.To address this issue,we focus on polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans;Both applications require significant effort and time for pixel-wise annotation due to the high dimensional nature of the data,involving either temporary or spatial dimensions.In this paper,we develop a new annotation strategy,termed Drag&Drop,which simplifies the annotation process to drag and drop.This annotation strategy is more efficient,particularly for temporal and volumetric imaging,than other types of weak annotations,such as per-pixel,bounding boxes,scribbles,ellipses and points.Furthermore,to exploit our Drag&Drop annotations,we develop a novel weakly supervised learning method based on the watershed algorithm.Experimental results show that our method achieves better detection and localization performance than alternative weak annotations and,more importantly,achieves similar performance to that trained on detailed per-pixel annotations.Interestingly,we find that,with limited resources,allocating weak annotations from a diverse patient population can foster models more robust to unseen images than allocating per-pixel annotations for a small set of images.In summary,this research proposes an efficient annotation strategy for tumor detection and localization that is less accurate than per-pixel annotations but useful for creating large-scale datasets for screening tumors in various medical modalities.
文摘This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we add a context token to the input sequence,to facilitate information exchange with tokens corresponding to image patches throughout transformer layers.Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions,we propose a token-attention module(TAM)to recalibrate encoded features through channel-wise attention informed by the context token.Beyond that,it is adopted to predict the total person count of the image through regression-token module(RTM).Extensive experiments on various datasets,including ShanghaiTech,UCFQNRF,JHU-CROWD++and NWPU,demonstrate that the proposed context extraction techniques can significantly improve the performanceover the baselines.
基金This work was partly supported by the ETH Zürich Fund(OK),and by Huawei grants.
文摘While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising,existing methods mostly rely on simple noise assumptions,such as additive white Gaussian noise(AWGN),JPEG compression noise and camera sensor noise,and a general-purpose blind denoising method for real images remains unsolved.In this paper,we attempt to solve this problem from the perspective of network architecture design and training data synthesis.Specifically,for the network architecture design,we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block,and then plug it as the main building block into the widely-used image-to-image translation UNet architecture.For the training data synthesis,we design a practical noise degradation model which takes into consideration different kinds of noise(including Gaussian,Poisson,speckle,JPEG compression,and processed camera sensor noises)and resizing,and also involves a random shuffle strategy and a double degradation strategy.Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability.We believe our work can provide useful insights into current denoising research.The source code is available at https://github.com/cszn/SCUNet.
基金This work was supported by Karadeniz Technical University tinder Grant No.KTU-2004.112.009.001.
文摘This paper presents a novel approach for human identification at a distance using gait recognition. Recognition of a person from their gait is a biometric of increasing interest. The proposed work introduces a nonlinear machine learning method, kernel Principal Component Analysis (PCA), to extract gait features from silhouettes for individual recognition. Binarized silhouette of a motion object is first represented by four 1-D signals which are the basic image features called the distance vectors. Fourier transform is performed to achieve translation invariant for the gait patterns accumulated from silhouette sequences which are extracted from different circumstances. Kernel PCA is then used to extract higher order relations among the gait patterns for future recognition. A fusion strategy is finally executed to produce a final decision. The experiments are carried out on the CMU and the USF gait databases and presented based on the different training gait cycles.
基金supported by the National Natural Science Foundation of China(Nos.62176169 and 61703077)SCU-Luzhou Municipal People's Government Strategic Cooperation Projetc(t No.2020CDLZ-10)+1 种基金supported by the National Natural Science Foundation of China(No.62172228)supported by the National Natural Science Foundation of China(No.61773270).
文摘Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways,using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend.This paper provides the first comprehensive review and a benchmark for light field SOD,which has long been lacking in the saliency community.Firstly,we introduce light fields,including theory and data forms,and then review existing studies on light field SOD,covering ten traditional models,seven deep learning-based models,a comparative study,and a brief review.Existing datasets for light field SOD are also summarized.Secondly,we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets,providing insightful discussions and analyses,including a comparison between light field SOD and RGB-D SOD models.Due to the inconsistency of current datasets,we further generate complete data and supplement focal stacks,depth maps,and multi-view images for them,making them consistent and uniform.Our supplemental data make a universal benchmark possible.Lastly,light field SOD is a specialised problem,because of its diverse data representations and high dependency on acquisition hardware,so it differs greatly from other saliency detection tasks.We provide nine observations on challenges and future directions,and outline several open issues.All the materials including models,datasets,benchmarking results,and supplemented light field datasets are publicly available at https://github.com/kerenfu/LFSOD-Survey.
文摘Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track record in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Specifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sensing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenarios,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting finegrained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.
基金supported in part by the National Natural Science Foundation of China under Grant No.62172228in part by an Open Project of the Key Laboratory of System Control and Information Processing,Ministry of Education(Shanghai Jiao Tong University,No.Scip202102).
文摘Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.
文摘Interactive image segmentation(IIS)is an important technique for obtaining pixel-level annotations.In many cases,target objects share similar semantics.However,IIS methods neglect this connection and in particular the cues provided by representations of previously segmented objects,previous user interaction,and previous prediction masks,which can all provide suitable priors for the current annotation.In this paper,we formulate a sequential interactive image segmentation(SIIS)task for minimizing user interaction when segmenting sequences of related images,and we provide a practical approach to this task using two pertinent designs.The first is a novel interaction mode.When annotating a new sample,our method can automatically propose an initial click proposal based on previous annotation.This dramatically helps to reduce the interaction burden on the user.The second is an online optimization strategy,with the goal of providing semantic information when annotating specific targets,optimizing the model with dense supervision from previously labeled samples.Experiments demonstrate the effectiveness of regarding SIIS as a particular task,and our methods for addressing it.
基金supported by the National Science Foundation[grant number 2036193]supported in part by Office of Naval Research[grant numbers N00014-17-l-2928,N00014-20-1-2141].
文摘In this paper,we present a case study that performs an unmanned aerial vehicle(UAV)based fine-scale 3D change detection and monitoring of progressive collapse performance of a building during a demolition event.Multi-temporal oblique photogrammetry images are collected with 3D point clouds generated at different stages of the demolition.The geometric accuracy of the generated point clouds has been evaluated against both airborne and terrestrial LiDAR point clouds,achieving an average distance of 12 cm and 16 cm for roof and façade respectively.We propose a hierarchical volumetric change detection framework that unifies multi-temporal UAV images for pose estimation(free of ground control points),reconstruction,and a coarse-to-fine 3D density change analysis.This work has provided a solution capable of addressing change detection on full 3D time-series datasets where dramatic scene content changes are presented progressively.Our change detection results on the building demolition event have been evaluated against the manually marked ground-truth changes and have achieved an F-1 score varying from 0.78 to 0.92,with consistently high precision(0.92–0.99).Volumetric changes through the demolition progress are derived from change detection and have been shown to favorably reflect the qualitative and quantitative building demolition progression.
基金supported fully by the TUBITAK Research Project under Grant No. 107E212.
文摘This paper presents a wavelet-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. Kernel PCA is a technique for nonlinear dimension reduction of data with an underlying nonlinear spatial structure. The intensity values of the palmprint image are first normalized by using mean and standard deviation. The palmprint is then transformed into the wavelet domain to decompose palm images and the lowest resolution subband coefficients are chosen for palm representation. The kernel PCA method is then applied to extract non-linear features from the subband coefficients. Finally, similarity measurement is accomplished by using weighted Euclidean linear distance-based nearest neighbor classifier. Experimental results on PolyU Palmprint Databases demonstrate that the proposed approach achieves highly competitive performance with respect to the published palmprint recognition approaches.