The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call...The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.展开更多
Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing approp...Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing appropriate medical imaging automatically in different clinical scenarios.Methods Institutional Review Boards(IRB)approval was not required due to the use of nonidentifiable data.Instead,we used 112 questions from the American College of Radiology(ACR)Radiology-TEACHES Program as prompts,which is an open-sourced question and answer program to guide appropriate medical imaging.We included 69 free-text case vignettes and 43 simplified cases.For the performance evaluation of GPT-4 and GPT-3.5,we considered the recommendations of ACR guidelines as the gold standard,and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR.We set a five-score criterion for the evaluation of the consistency.A paired t-test was applied to assess the statistical significance of the findings.Results For the performance of the GPT models in free-text case vignettes,the accuracy of GPT-4 was 92.9%,whereas the accuracy of GPT-3.5 was just 78.3%.GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5(t=3.429,P=0.001).For the performance of the GPT models in simplified scenarios,the accuracy of GPT-4 and GPT-3.5 was 66.5%and 60.0%,respectively.The differences were not statistically significant(t=1.858,P=0.070).GPT-4 was characterized by longer reaction times(27.1 s in average)and extensive responses(137.1 words on average)than GPT-3.5.Conclusion As an advanced tool for improving value-based healthcare in clinics,GPT-4 may guide appropriate medical imaging accurately and efficiently。展开更多
Trained on a large corpus,pretrained models(PTMs)can capture different levels of concepts in context and hence generate universal language representations,which greatly benefit downstream natural language processing(N...Trained on a large corpus,pretrained models(PTMs)can capture different levels of concepts in context and hence generate universal language representations,which greatly benefit downstream natural language processing(NLP)tasks.In recent years,PTMs have been widely used in most NLP applications,especially for high-resource languages,such as English and Chinese.However,scarce resources have discouraged the progress of PTMs for low-resource languages.Transformer-based PTMs for the Khmer language are presented in this work for the first time.We evaluate our models on two downstream tasks:Part-of-speech tagging and news categorization.The dataset for the latter task is self-constructed.Experiments demonstrate the effectiveness of the Khmer models.In addition,we find that the current Khmer word segmentation technology does not aid performance improvement.We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requir...Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.展开更多
In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and comput...In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.展开更多
As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects in...As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.展开更多
Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the ...Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the scale of the parallel training corpus is limited.Considering the pre-trained language model has a strong ability for monolingual representation,it is the key challenge for machine translation to construct the in-depth relationship between the source and target language by injecting the lexical and syntactic information into pre-trained language models.To alleviate the dependence on the parallel corpus,we propose a Linguistics Knowledge-Driven MultiTask(LKMT)approach to inject part-of-speech and syntactic knowledge into pre-trained models,thus enhancing the machine translation performance.On the one hand,we integrate part-of-speech and dependency labels into the embedding layer and exploit large-scale monolingual corpus to update all parameters of pre-trained language models,thus ensuring the updated language model contains potential lexical and syntactic information.On the other hand,we leverage an extra self-attention layer to explicitly inject linguistic knowledge into the pre-trained language model-enhanced machine translation model.Experiments on the benchmark dataset show that our proposed LKMT approach improves the Urdu-English translation accuracy by 1.97 points and the English-Urdu translation accuracy by 2.42 points,highlighting the effectiveness of our LKMT framework.Detailed ablation experiments confirm the positive impact of part-of-speech and dependency parsing on machine translation.展开更多
Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to p...Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.展开更多
Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured for...Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured form of hand gestures.The fundamental building blocks of structured expressions in sign language are the arrangement of the fingers,the orientation of the hand,and the hand’s position concerning the body.The importance of HGR has increased due to the increasing number of touchless applications and the rapid growth of the hearing-impaired population.Therefore,real-time HGR is one of the most effective interaction methods between computers and humans.Developing a user-free interface with good recognition performance should be the goal of real-time HGR systems.Nowadays,Convolutional Neural Network(CNN)shows great recognition rates for different image-level classification tasks.It is challenging to train deep CNN networks like VGG-16,VGG-19,Inception-v3,and Efficientnet-B0 from scratch because only some significant labeled image datasets are available for static hand gesture images.However,an efficient and robust hand gesture recognition system of sign language employing finetuned Inception-v3 and Efficientnet-Bo network is proposed to identify hand gestures using a comparative small HGR dataset.Experiments show that Inception-v3 achieved 90%accuracy and 0.93%precision,0.91%recall,and 0.90%f1-score,respectively,while EfficientNet-B0 achieved 99%accuracy and 0.98%,0.97%,0.98%,precision,recall,and f1-score respectively.展开更多
Identifying personalities accurately helps merchants and management departments understand user needs in detail and improve the quality of service and decision-making efficiency.Existing research on text-based persona...Identifying personalities accurately helps merchants and management departments understand user needs in detail and improve the quality of service and decision-making efficiency.Existing research on text-based personality prediction mainly uses deep neural networks or pretrained language models to mine deep semantics,ignoring the dynamic interactions among personality features.This paper presents a novel personality prediction method that simultaneously taps into the capability of graph neural networks to model the deep interactions among features and that of pretrained language models to learn latent semantics with a hierarchical aggregation mechanism.Specifically,the proposed model leverages self-attention to capture the interaction relationships among POS tags,entities,personality tags,etc.,and considers the labels’cooccurrence patterns.The efficacy of the proposed model is evaluated on the myPersonality and PANDORA datasets.This research contributes to the personality prediction literature from the perspective of a multigranular personality feature learning perspective and provides business value for consuming predictive analytics.展开更多
Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textur...Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks(GANs) and the contrastive language-image pretraining(CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range(LDR) images are converted to high dynamic range(HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation.展开更多
The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to th...The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.展开更多
Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI ...Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.展开更多
Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as wit...Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.展开更多
Objective:This study aimed to construct an intelligent prescription-generating(IPG)model based on deep-learning natural language processing(NLP)technology for multiple prescriptions in Chinese medicine.Materials and M...Objective:This study aimed to construct an intelligent prescription-generating(IPG)model based on deep-learning natural language processing(NLP)technology for multiple prescriptions in Chinese medicine.Materials and Methods:We selected the Treatise on Febrile Diseases and the Synopsis of Golden Chamber as basic datasets with EDA data augmentation,and the Yellow Emperor’s Canon of Internal Medicine,the Classic of the Miraculous Pivot,and the Classic on Medical Problems as supplementary datasets for fine-tuning.We selected the word-embedding model based on the Imperial Collection of Four,the bidirectional encoder representations from transformers(BERT)model based on the Chinese Wikipedia,and the robustly optimized BERT approach(RoBERTa)model based on the Chinese Wikipedia and a general database.In addition,the BERT model was fine-tuned using the supplementary datasets to generate a Traditional Chinese Medicine-BERT model.Multiple IPG models were constructed based on the pretraining strategy and experiments were performed.Metrics of precision,recall,and F1-score were used to assess the model performance.Based on the trained models,we extracted and visualized the semantic features of some typical texts from treatise on febrile diseases and investigated the patterns.Results:Among all the trained models,the RoBERTa-large model performed the best,with a test set precision of 92.22%,recall of 86.71%,and F1-score of 89.38%and 10-fold cross-validation precision of 94.5%±2.5%,recall of 90.47%±4.1%,and F1-score of 92.38%±2.8%.The semantic feature extraction results based on this model showed that the model was intelligently stratified based on different meanings such that the within-layer’s patterns showed the associations of symptom–symptoms,disease–symptoms,and symptom–punctuations,while the between-layer’s patterns showed a progressive or dynamic symptom and disease transformation.Conclusions:Deep-learning-based NLP technology significantly improves the performance of IPG model.In addition,NLP-based semantic feature extraction may be vital to further investigate the ancient Chinese medicine texts.展开更多
In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conf...In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conference(SMP),and introduce the related data set-SMP-CAIL2020-Argmine.The task challenged participants to choose the correct argument among five candidates proposed by the defense to refute or acknowledge the given argument made by the plaintiff,providing the full context recorded in the judgement documents of both parties.We received entries from 63 competing teams,38 of which scored higher than the provided baseline model(BERT)in the first phase and entered the second phase.The best performing system in the two phases achieved accuracy of 0.856 and 0.905,respectively.In this paper,we will present the results of the competition and a summary of the systems,highlighting commonalities and innovations among participating systems.The SMP-CAIL2020-Argmine data set and baseline modelshave been already released.展开更多
Tuberculosis caused by Mycobacterium tuberculosis have been a major challenge for medical and healthcare sectors in many underdeveloped countries with limited diagnosis tools.Tuberculosis can be detected from microsco...Tuberculosis caused by Mycobacterium tuberculosis have been a major challenge for medical and healthcare sectors in many underdeveloped countries with limited diagnosis tools.Tuberculosis can be detected from microscopic slides and chest X-ray but as a result of the high cases of tuberculosis,this method can be tedious for both Microbiologists and Radiologists and can lead to miss-diagnosis.These challenges can be solved by employing Computer-Aided Detection(CAD)via Al-driven models which learn features based on convolution and result in an output with high accuracy.In this paper,we described automated discrimination of X-ray and microscope slide images into tuberculosis and non-tuberculosis cases using pretrained AlexNet Models.The study employed Chest X-ray dataset made available on Kaggle repository and microscopic slide images from both Near East University Hospital and Kaggle repository.For classification of tuberculosis using microscopic slide images,the model achieved 90.56%accuracy,97.78%sensitivity and 83.33%specificity for 70:30 splits.For classification of tuberculosis using X-ray images,the model achieved 93.89%accuracy,96.67%sensitivity and 91.11%specificity for 70:30 splits.Our result is in line with the notion that CNN models can be used for classifying medical images with higher accuracy and precision.展开更多
Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language m...Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space.These video-language models are proficient at representing short-term actions,basic objects,and their combinations,but they are still far from understanding long-term procedures.In addition,the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures.Therefore,we propose a novel compositional prompt learning(CPL)framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems.Specifically,the proposed CPL consists of one visual prompt and three compositional textual prompts(including the action prompt,object prompt,and procedure prompt),which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding.Besides,the task reformulation enables our CPL to perform well in all zero-shot,few-shot,and fully-supervised settings.Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.展开更多
Purpose–According to the Indian Sign Language Research and Training Centre(ISLRTC),India has approximately 300 certified human interpreters to help people with hearing loss.This paper aims to address the issue of Ind...Purpose–According to the Indian Sign Language Research and Training Centre(ISLRTC),India has approximately 300 certified human interpreters to help people with hearing loss.This paper aims to address the issue of Indian Sign Language(ISL)sentence recognition and translation into semantically equivalent English text in a signer-independent mode.Design/methodology/approach–This study presents an approach that translates ISL sentences into English text using the MobileNetV2 model and Neural Machine Translation(NMT).The authors have created an ISL corpus from the Brown corpus using ISL grammar rules to perform machine translation.The authors’approach converts ISL videos of the newly created dataset into ISL gloss sequences using the MobileNetV2 model and the recognized ISL gloss sequence is then fed to a machine translation module that generates an English sentence for each ISL sentence.Findings–As per the experimental results,pretrained MobileNetV2 model was proven the best-suited model for the recognition of ISL sentences and NMT provided better results than Statistical Machine Translation(SMT)to convert ISL text into English text.The automatic and human evaluation of the proposed approach yielded accuracies of 83.3 and 86.1%,respectively.Research limitations/implications–It can be seen that the neural machine translation systems produced translations with repetitions of other translated words,strange translations when the total number of words per sentence is increased and one or more unexpected terms that had no relation to the source text on occasion.The most common type of error is the mistranslation of places,numbers and dates.Although this has little effect on the overall structure of the translated sentence,it indicates that the embedding learned for these few words could be improved.Originality/value–Sign language recognition and translation is a crucial step toward improving communication between the deaf and the rest of society.Because of the shortage of human interpreters,an alternative approach is desired to help people achieve smooth communication with the Deaf.To motivate research in this field,the authors generated an ISL corpus of 13,720 sentences and a video dataset of 47,880 ISL videos.As there is no public dataset available for ISl videos incorporating signs released by ISLRTC,the authors created a new video dataset and ISL corpus.展开更多
文摘The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.
基金National Natural Science Foundation of China(Grant Nos.62171297 and 61931013).
文摘Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing appropriate medical imaging automatically in different clinical scenarios.Methods Institutional Review Boards(IRB)approval was not required due to the use of nonidentifiable data.Instead,we used 112 questions from the American College of Radiology(ACR)Radiology-TEACHES Program as prompts,which is an open-sourced question and answer program to guide appropriate medical imaging.We included 69 free-text case vignettes and 43 simplified cases.For the performance evaluation of GPT-4 and GPT-3.5,we considered the recommendations of ACR guidelines as the gold standard,and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR.We set a five-score criterion for the evaluation of the consistency.A paired t-test was applied to assess the statistical significance of the findings.Results For the performance of the GPT models in free-text case vignettes,the accuracy of GPT-4 was 92.9%,whereas the accuracy of GPT-3.5 was just 78.3%.GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5(t=3.429,P=0.001).For the performance of the GPT models in simplified scenarios,the accuracy of GPT-4 and GPT-3.5 was 66.5%and 60.0%,respectively.The differences were not statistically significant(t=1.858,P=0.070).GPT-4 was characterized by longer reaction times(27.1 s in average)and extensive responses(137.1 words on average)than GPT-3.5.Conclusion As an advanced tool for improving value-based healthcare in clinics,GPT-4 may guide appropriate medical imaging accurately and efficiently。
基金supported by the Major Projects of Guangdong Education Department for Foundation Research and Applied Research(No.2017KZDXM031)Guangzhou Science and Technology Plan Project(No.202009010021)。
文摘Trained on a large corpus,pretrained models(PTMs)can capture different levels of concepts in context and hence generate universal language representations,which greatly benefit downstream natural language processing(NLP)tasks.In recent years,PTMs have been widely used in most NLP applications,especially for high-resource languages,such as English and Chinese.However,scarce resources have discouraged the progress of PTMs for low-resource languages.Transformer-based PTMs for the Khmer language are presented in this work for the first time.We evaluate our models on two downstream tasks:Part-of-speech tagging and news categorization.The dataset for the latter task is self-constructed.Experiments demonstrate the effectiveness of the Khmer models.In addition,we find that the current Khmer word segmentation technology does not aid performance improvement.We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
文摘Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.
基金Supported by Sichuan Science and Technology Program(2021YFQ0003,2023YFSY0026,2023YFH0004).
文摘In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.
文摘As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.
基金supported by the National Natural Science Foundation of China under Grant(61732005,61972186)Yunnan Provincial Major Science and Technology Special Plan Projects(Nos.202103AA080015,202203AA080004).
文摘Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the scale of the parallel training corpus is limited.Considering the pre-trained language model has a strong ability for monolingual representation,it is the key challenge for machine translation to construct the in-depth relationship between the source and target language by injecting the lexical and syntactic information into pre-trained language models.To alleviate the dependence on the parallel corpus,we propose a Linguistics Knowledge-Driven MultiTask(LKMT)approach to inject part-of-speech and syntactic knowledge into pre-trained models,thus enhancing the machine translation performance.On the one hand,we integrate part-of-speech and dependency labels into the embedding layer and exploit large-scale monolingual corpus to update all parameters of pre-trained language models,thus ensuring the updated language model contains potential lexical and syntactic information.On the other hand,we leverage an extra self-attention layer to explicitly inject linguistic knowledge into the pre-trained language model-enhanced machine translation model.Experiments on the benchmark dataset show that our proposed LKMT approach improves the Urdu-English translation accuracy by 1.97 points and the English-Urdu translation accuracy by 2.42 points,highlighting the effectiveness of our LKMT framework.Detailed ablation experiments confirm the positive impact of part-of-speech and dependency parsing on machine translation.
基金This work is supported by the project“Research on Methods and Technologies of Scientific Researcher Entity Linking and Subject Indexing”(Grant No.G190091)from the National Science Library,Chinese Academy of Sciencesthe project“Design and Research on a Next Generation of Open Knowledge Services System and Key Technologies”(2019XM55).
文摘Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
基金This research work was supported by the National Research Foundation of Korea(NRF)grant funded by the Korean government(MSIT)(NRF-2022R1A2C1004657).
文摘Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured form of hand gestures.The fundamental building blocks of structured expressions in sign language are the arrangement of the fingers,the orientation of the hand,and the hand’s position concerning the body.The importance of HGR has increased due to the increasing number of touchless applications and the rapid growth of the hearing-impaired population.Therefore,real-time HGR is one of the most effective interaction methods between computers and humans.Developing a user-free interface with good recognition performance should be the goal of real-time HGR systems.Nowadays,Convolutional Neural Network(CNN)shows great recognition rates for different image-level classification tasks.It is challenging to train deep CNN networks like VGG-16,VGG-19,Inception-v3,and Efficientnet-B0 from scratch because only some significant labeled image datasets are available for static hand gesture images.However,an efficient and robust hand gesture recognition system of sign language employing finetuned Inception-v3 and Efficientnet-Bo network is proposed to identify hand gestures using a comparative small HGR dataset.Experiments show that Inception-v3 achieved 90%accuracy and 0.93%precision,0.91%recall,and 0.90%f1-score,respectively,while EfficientNet-B0 achieved 99%accuracy and 0.98%,0.97%,0.98%,precision,recall,and f1-score respectively.
基金supported by the National Natural Science Foundation of China(Nos.72293575,62071467 and 62141608).
文摘Identifying personalities accurately helps merchants and management departments understand user needs in detail and improve the quality of service and decision-making efficiency.Existing research on text-based personality prediction mainly uses deep neural networks or pretrained language models to mine deep semantics,ignoring the dynamic interactions among personality features.This paper presents a novel personality prediction method that simultaneously taps into the capability of graph neural networks to model the deep interactions among features and that of pretrained language models to learn latent semantics with a hierarchical aggregation mechanism.Specifically,the proposed model leverages self-attention to capture the interaction relationships among POS tags,entities,personality tags,etc.,and considers the labels’cooccurrence patterns.The efficacy of the proposed model is evaluated on the myPersonality and PANDORA datasets.This research contributes to the personality prediction literature from the perspective of a multigranular personality feature learning perspective and provides business value for consuming predictive analytics.
文摘Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks(GANs) and the contrastive language-image pretraining(CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range(LDR) images are converted to high dynamic range(HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation.
基金supported by the National Key R&D Program of China(Nos.2019YFA0905700 and 2021YFC2101500)the National Natural Science Foundation of China(No.62072283).
文摘The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.
文摘Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.
基金supported by the Key Research and Development Program of Shaanxi(2023-YBGY-218)the National Natural Science Foundation of China under Grant(62372357 and 62201424)+1 种基金the Fundamental Research Funds for the Central Universities(QTZX23072)supported by the ISN State Key Laboratory.
文摘Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.
文摘Objective:This study aimed to construct an intelligent prescription-generating(IPG)model based on deep-learning natural language processing(NLP)technology for multiple prescriptions in Chinese medicine.Materials and Methods:We selected the Treatise on Febrile Diseases and the Synopsis of Golden Chamber as basic datasets with EDA data augmentation,and the Yellow Emperor’s Canon of Internal Medicine,the Classic of the Miraculous Pivot,and the Classic on Medical Problems as supplementary datasets for fine-tuning.We selected the word-embedding model based on the Imperial Collection of Four,the bidirectional encoder representations from transformers(BERT)model based on the Chinese Wikipedia,and the robustly optimized BERT approach(RoBERTa)model based on the Chinese Wikipedia and a general database.In addition,the BERT model was fine-tuned using the supplementary datasets to generate a Traditional Chinese Medicine-BERT model.Multiple IPG models were constructed based on the pretraining strategy and experiments were performed.Metrics of precision,recall,and F1-score were used to assess the model performance.Based on the trained models,we extracted and visualized the semantic features of some typical texts from treatise on febrile diseases and investigated the patterns.Results:Among all the trained models,the RoBERTa-large model performed the best,with a test set precision of 92.22%,recall of 86.71%,and F1-score of 89.38%and 10-fold cross-validation precision of 94.5%±2.5%,recall of 90.47%±4.1%,and F1-score of 92.38%±2.8%.The semantic feature extraction results based on this model showed that the model was intelligently stratified based on different meanings such that the within-layer’s patterns showed the associations of symptom–symptoms,disease–symptoms,and symptom–punctuations,while the between-layer’s patterns showed a progressive or dynamic symptom and disease transformation.Conclusions:Deep-learning-based NLP technology significantly improves the performance of IPG model.In addition,NLP-based semantic feature extraction may be vital to further investigate the ancient Chinese medicine texts.
基金supported by National Key Research and Development Plan(No.2018YFC0830600),and is cooperated with China Justice Big Data Institute,which provided judgement documents and the employment of professional annotators.The competition is also sponsored by Beijing Thunisoft Information Technology Co.,Ltd.,and supported by both CAIL and SMP organizers.
文摘In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge(CAIL)and the Chinese National Social Media Processing Conference(SMP),and introduce the related data set-SMP-CAIL2020-Argmine.The task challenged participants to choose the correct argument among five candidates proposed by the defense to refute or acknowledge the given argument made by the plaintiff,providing the full context recorded in the judgement documents of both parties.We received entries from 63 competing teams,38 of which scored higher than the provided baseline model(BERT)in the first phase and entered the second phase.The best performing system in the two phases achieved accuracy of 0.856 and 0.905,respectively.In this paper,we will present the results of the competition and a summary of the systems,highlighting commonalities and innovations among participating systems.The SMP-CAIL2020-Argmine data set and baseline modelshave been already released.
文摘Tuberculosis caused by Mycobacterium tuberculosis have been a major challenge for medical and healthcare sectors in many underdeveloped countries with limited diagnosis tools.Tuberculosis can be detected from microscopic slides and chest X-ray but as a result of the high cases of tuberculosis,this method can be tedious for both Microbiologists and Radiologists and can lead to miss-diagnosis.These challenges can be solved by employing Computer-Aided Detection(CAD)via Al-driven models which learn features based on convolution and result in an output with high accuracy.In this paper,we described automated discrimination of X-ray and microscope slide images into tuberculosis and non-tuberculosis cases using pretrained AlexNet Models.The study employed Chest X-ray dataset made available on Kaggle repository and microscopic slide images from both Near East University Hospital and Kaggle repository.For classification of tuberculosis using microscopic slide images,the model achieved 90.56%accuracy,97.78%sensitivity and 83.33%specificity for 70:30 splits.For classification of tuberculosis using X-ray images,the model achieved 93.89%accuracy,96.67%sensitivity and 91.11%specificity for 70:30 splits.Our result is in line with the notion that CNN models can be used for classifying medical images with higher accuracy and precision.
文摘Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space.These video-language models are proficient at representing short-term actions,basic objects,and their combinations,but they are still far from understanding long-term procedures.In addition,the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures.Therefore,we propose a novel compositional prompt learning(CPL)framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems.Specifically,the proposed CPL consists of one visual prompt and three compositional textual prompts(including the action prompt,object prompt,and procedure prompt),which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding.Besides,the task reformulation enables our CPL to perform well in all zero-shot,few-shot,and fully-supervised settings.Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
文摘Purpose–According to the Indian Sign Language Research and Training Centre(ISLRTC),India has approximately 300 certified human interpreters to help people with hearing loss.This paper aims to address the issue of Indian Sign Language(ISL)sentence recognition and translation into semantically equivalent English text in a signer-independent mode.Design/methodology/approach–This study presents an approach that translates ISL sentences into English text using the MobileNetV2 model and Neural Machine Translation(NMT).The authors have created an ISL corpus from the Brown corpus using ISL grammar rules to perform machine translation.The authors’approach converts ISL videos of the newly created dataset into ISL gloss sequences using the MobileNetV2 model and the recognized ISL gloss sequence is then fed to a machine translation module that generates an English sentence for each ISL sentence.Findings–As per the experimental results,pretrained MobileNetV2 model was proven the best-suited model for the recognition of ISL sentences and NMT provided better results than Statistical Machine Translation(SMT)to convert ISL text into English text.The automatic and human evaluation of the proposed approach yielded accuracies of 83.3 and 86.1%,respectively.Research limitations/implications–It can be seen that the neural machine translation systems produced translations with repetitions of other translated words,strange translations when the total number of words per sentence is increased and one or more unexpected terms that had no relation to the source text on occasion.The most common type of error is the mistranslation of places,numbers and dates.Although this has little effect on the overall structure of the translated sentence,it indicates that the embedding learned for these few words could be improved.Originality/value–Sign language recognition and translation is a crucial step toward improving communication between the deaf and the rest of society.Because of the shortage of human interpreters,an alternative approach is desired to help people achieve smooth communication with the Deaf.To motivate research in this field,the authors generated an ISL corpus of 13,720 sentences and a video dataset of 47,880 ISL videos.As there is no public dataset available for ISl videos incorporating signs released by ISLRTC,the authors created a new video dataset and ISL corpus.