Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotiona...Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotional states of speakers holds significant importance in a range of real-time applications,including but not limited to virtual reality,human-robot interaction,emergency centers,and human behavior assessment.Accurately identifying emotions in the SER process relies on extracting relevant information from audio inputs.Previous studies on SER have predominantly utilized short-time characteristics such as Mel Frequency Cepstral Coefficients(MFCCs)due to their ability to capture the periodic nature of audio signals effectively.Although these traits may improve their ability to perceive and interpret emotional depictions appropriately,MFCCS has some limitations.So this study aims to tackle the aforementioned issue by systematically picking multiple audio cues,enhancing the classifier model’s efficacy in accurately discerning human emotions.The utilized dataset is taken from the EMO-DB database,preprocessing input speech is done using a 2D Convolution Neural Network(CNN)involves applying convolutional operations to spectrograms as they afford a visual representation of the way the audio signal frequency content changes over time.The next step is the spectrogram data normalization which is crucial for Neural Network(NN)training as it aids in faster convergence.Then the five auditory features MFCCs,Chroma,Mel-Spectrogram,Contrast,and Tonnetz are extracted from the spectrogram sequentially.The attitude of feature selection is to retain only dominant features by excluding the irrelevant ones.In this paper,the Sequential Forward Selection(SFS)and Sequential Backward Selection(SBS)techniques were employed for multiple audio cues features selection.Finally,the feature sets composed from the hybrid feature extraction methods are fed into the deep Bidirectional Long Short Term Memory(Bi-LSTM)network to discern emotions.Since the deep Bi-LSTM can hierarchically learn complex features and increases model capacity by achieving more robust temporal modeling,it is more effective than a shallow Bi-LSTM in capturing the intricate tones of emotional content existent in speech signals.The effectiveness and resilience of the proposed SER model were evaluated by experiments,comparing it to state-of-the-art SER techniques.The results indicated that the model achieved accuracy rates of 90.92%,93%,and 92%over the Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS),Berlin Database of Emotional Speech(EMO-DB),and The Interactive Emotional Dyadic Motion Capture(IEMOCAP)datasets,respectively.These findings signify a prominent enhancement in the ability to emotional depictions identification in speech,showcasing the potential of the proposed model in advancing the SER field.展开更多
Pedestrian attributes recognition is a very important problem in video surveillance and video forensics. Traditional methods assume the pedestrian attributes are independent and design handcraft features for each one....Pedestrian attributes recognition is a very important problem in video surveillance and video forensics. Traditional methods assume the pedestrian attributes are independent and design handcraft features for each one. In this paper, we propose a joint hierarchical multi-task learning algorithm to learn the relationships among attributes for better recognizing the pedestrian attributes in still images using convolutional neural networks(CNN). We divide the attributes into local and global ones according to spatial and semantic relations, and then consider learning semantic attributes through a hierarchical multi-task CNN model where each CNN in the first layer will predict each group of such local attributes and CNN in the second layer will predict the global attributes. Our multi-task learning framework allows each CNN model to simultaneously share visual knowledge among different groups of attribute categories. Extensive experiments are conducted on two popular and challenging benchmarks in surveillance scenarios, namely, the PETA and RAP pedestrian attributes datasets. On both benchmarks, our framework achieves superior results over the state-of-theart methods by 88.2% on PETA and 83.25% on RAP, respectively.展开更多
The coal-rock interface recognition method based on multi-sensor data fusiontechnique is put forward because of the localization of single type sensor recognition method. Themeasuring theory based on multi-sensor data...The coal-rock interface recognition method based on multi-sensor data fusiontechnique is put forward because of the localization of single type sensor recognition method. Themeasuring theory based on multi-sensor data fusion technique is analyzed, and hereby the testplatform of recognition system is manufactured. The advantage of data fusion with the fuzzy neuralnetwork (FNN) technique has been probed. The two-level FNN is constructed and data fusion is carriedout. The experiments show that in various conditions the method can always acquire a much higherrecognition rate than normal ones.展开更多
Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become availa...Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.展开更多
This paper presents two systems for recognizing static signs (digits) from American Sign Language (ASL). These systems avoid the use color marks, or gloves, using instead, low-pass and high-pass filters in space and f...This paper presents two systems for recognizing static signs (digits) from American Sign Language (ASL). These systems avoid the use color marks, or gloves, using instead, low-pass and high-pass filters in space and frequency domains, and color space transformations. First system used rotational signatures based on a correlation operator;minimum distance was used for the classification task. Second system computed the seven Hu invariants from binary images;these descriptors fed to a Multi-Layer Perceptron (MLP) in order to recognize the 9 different classes. First system achieves 100% of recognition rate with leaving-one-out validation and second experiment performs 96.7% of recognition rate with Hu moments and 100% using 36 normalized moments and k-fold cross validation.展开更多
Athletes have various emotions before competition, and mood states have impact on the competi- tion results. Recognition of athletes’ mood states could help athletes to have better adjustment before competition, whic...Athletes have various emotions before competition, and mood states have impact on the competi- tion results. Recognition of athletes’ mood states could help athletes to have better adjustment before competition, which is significant to competition achievements. In this paper, physiological signals of female rowing athletes in pre- and post-competition were collected. Based on the multi-physiological signals related to pre- and post-competition, such as heart rate and respiration rate, features were extracted which had been subtracted the emotion baseline. Then the particle swarm optimization (PSO) was adopted to optimize the feature selection from the feature set, and combined with the least squares support vector machine (LS-SVM) classifier. Positive mood states and negative mood states were classified by the LS-SVM with PSO feature optimization. The results showed that the classification accuracy by the LS-SVM algorithm combined with PSO and baseline subtraction was better than the condition without baseline subtraction. The combination can contribute to good classification of mood states of rowing athletes, and would be informative to psychological adjustment of athletes.展开更多
The paper proposes a new method of multi-band signal reconstruction based on Orthogonal Matching Pursuit(OMP),which aims to develop a robust Ecological Sounds Recognition(ESR)system.Firstly,the OMP is employed to spar...The paper proposes a new method of multi-band signal reconstruction based on Orthogonal Matching Pursuit(OMP),which aims to develop a robust Ecological Sounds Recognition(ESR)system.Firstly,the OMP is employed to sparsely decompose the original signal,thus the high correlation components are retained to reconstruct in the first stage.Then,according to the frequency distribution of both foreground sound and background noise,the signal can be compensated by the residual components in the second stage.Via the two-stage reconstruction,high non-stationary noises are effectively reduced,and the reconstruction precision of foreground sound is improved.At recognition stage,we employ deep belief networks to model the composite feature sets extracted from reconstructed signal.The experimental results show that the proposed approach achieved superior recognition performance on 60 classes of ecological sounds in different environments under different Signal-to-Noise Ratio(SNR),compared with the existing method.展开更多
自动安全换道是车辆实现无人驾驶的关键,为精确识别行驶车辆换道状态,保证行车安全,设计了一种基于多分类支持向量机(Multi-class Support Vector Machine,Multiclass SVM)的车辆换道识别模型。从NGSIM数据集中选取美国101公路车辆轨迹...自动安全换道是车辆实现无人驾驶的关键,为精确识别行驶车辆换道状态,保证行车安全,设计了一种基于多分类支持向量机(Multi-class Support Vector Machine,Multiclass SVM)的车辆换道识别模型。从NGSIM数据集中选取美国101公路车辆轨迹数据进行分类处理,并将车辆换道过程划分为车辆跟驰阶段、车辆换道准备阶段和车辆换道执行阶段。采用网格搜索结合粒子群优化算法(Grid Search-PSO)对SVM模型中惩罚参数C和核参数g进行寻优标定,利用多分类支持向量机换道识别模型对样本数据进行训练和测试,模型测试精度达97.68%。研究表明,模型能够很好地识别车辆在换道过程中的行为状态,为车辆换道阶段的研究提供支持。展开更多
Understanding people's emotions through natural language is a challenging task for intelligent systems based on Internet of Things(Io T). The major difficulty is caused by the lack of basic knowledge in emotion ex...Understanding people's emotions through natural language is a challenging task for intelligent systems based on Internet of Things(Io T). The major difficulty is caused by the lack of basic knowledge in emotion expressions with respect to a variety of real world contexts. In this paper, we propose a Bayesian inference method to explore the latent semantic dimensions as contextual information in natural language and to learn the knowledge of emotion expressions based on these semantic dimensions. Our method synchronously infers the latent semantic dimensions as topics in words and predicts the emotion labels in both word-level and document-level texts. The Bayesian inference results enable us to visualize the connection between words and emotions with respect to different semantic dimensions. And by further incorporating a corpus-level hierarchy in the document emotion distribution assumption, we could balance the document emotion recognition results and achieve even better word and document emotion predictions. Our experiment of the wordlevel and the document-level emotion predictions, based on a well-developed Chinese emotion corpus Ren-CECps, renders both higher accuracy and better robustness in the word-level and the document-level emotion predictions compared to the state-of-theart emotion prediction algorithms.展开更多
An algorithm for face description and recognition based on multi-resolution with multi-scale local binary pattern (multi-LBP) features is proposed. The facial image pyramid is constructed and each facial image is di...An algorithm for face description and recognition based on multi-resolution with multi-scale local binary pattern (multi-LBP) features is proposed. The facial image pyramid is constructed and each facial image is divided into various regions from which partial and holistic local binary patter (LBP) histograms are extracted. All LBP features of each image are concatenated to a single LBP eigenvector with different resolutions. The dimensionaUty of LBP features is then reduced by a local margin alignment (LMA) algorithm based on manifold, which can preserve the between-class variance. Support vector machine (SVM) is applied to classify facial images. Extensive experiments on ORL and CMU face databases clearly show the superiority of the proposed scheme over some existed algorithms, especially on the robustness of the method against different facial expressions and postures of the subjects.展开更多
In this paper,a new multiclass classification algorithm is proposed based on the idea of Locally Linear Embedding(LLE),to avoid the defect of traditional manifold learning algorithms,which can not deal with new sample...In this paper,a new multiclass classification algorithm is proposed based on the idea of Locally Linear Embedding(LLE),to avoid the defect of traditional manifold learning algorithms,which can not deal with new sample points.The algorithm defines an error as a criterion by computing a sample's reconstruction weight using LLE.Furthermore,the existence and characteristics of low dimensional manifold in range-profile time-frequency information are explored using manifold learning algorithm,aiming at the problem of target recognition about high range resolution MilliMeter-Wave(MMW) radar.The new algorithm is applied to radar target recognition.The experiment results show the algorithm is efficient.Compared with other classification algorithms,our method improves the recognition precision and the result is not sensitive to input parameters.展开更多
Segmenting Arabic handwritings had been one of the subjects of research in the field of Arabic character recognition for more than 25 years. The majority of reported segmentation techniques share a critical shortcomin...Segmenting Arabic handwritings had been one of the subjects of research in the field of Arabic character recognition for more than 25 years. The majority of reported segmentation techniques share a critical shortcoming, which is over-segmentation. The aim of segmentation is to produce the letters (segments) of a handwritten word. When a resulting letter (segment) is made of more than one piece (stroke) instead of one, this is called over-segmentation. Our objective is to overcome this problem by using an Artificial Neural Networks (ANN) to verify the resulting segment. We propose a set of heuristic-based rules to assemble strokes in order to report the precise segmented letters. Preprocessing phases that include normalization and feature extraction are required as a prerequisite step for the ANN system for recognition and verification. In our previous work [1], we did achieve a segmentation success rate of 86% but without recognition. In this work, our experimental results confirmed a segmentation success rate of no less than 95%.展开更多
Deep Learning is a powerful technique that is widely applied to Image Recognition and Natural Language Processing tasks amongst many other tasks. In this work, we propose an efficient technique to utilize pre-trained ...Deep Learning is a powerful technique that is widely applied to Image Recognition and Natural Language Processing tasks amongst many other tasks. In this work, we propose an efficient technique to utilize pre-trained Convolutional Neural Network (CNN) architectures to extract powerful features from images for object recognition purposes. We have built on the existing concept of extending the learning from pre-trained CNNs to new databases through activations by proposing to consider multiple deep layers. We have exploited the progressive learning that happens at the various intermediate layers of the CNNs to construct Deep Multi-Layer (DM-L) based Feature Extraction vectors to achieve excellent object recognition performance. Two popular pre-trained CNN architecture models i.e. the VGG_16 and VGG_19 have been used in this work to extract the feature sets from 3 deep fully connected multiple layers namely “fc6”, “fc7” and “fc8” from inside the models for object recognition purposes. Using the Principal Component Analysis (PCA) technique, the Dimensionality of the DM-L feature vectors has been reduced to form powerful feature vectors that have been fed to an external Classifier Ensemble for classification instead of the Softmax based classification layers of the two original pre-trained CNN models. The proposed DM-L technique has been applied to the Benchmark Caltech-101 object recognition database. Conventional wisdom may suggest that feature extractions based on the deepest layer i.e. “fc8” compared to “fc6” will result in the best recognition performance but our results have proved it otherwise for the two considered models. Our experiments have revealed that for the two models under consideration, the “fc6” based feature vectors have achieved the best recognition performance. State-of-the-Art recognition performances of 91.17% and 91.35% have been achieved by utilizing the “fc6” based feature vectors for the VGG_16 and VGG_19 models respectively. The recognition performance has been achieved by considering 30 sample images per class whereas the proposed system is capable of achieving improved performance by considering all sample images per class. Our research shows that for feature extraction based on CNNs, multiple layers should be considered and then the best layer can be selected that maximizes the recognition performance.展开更多
命名实体识别任务旨在识别出非结构化文本中所包含的实体并将其分配给预定义的实体类别中.随着互联网和社交媒体的发展,文本信息往往伴随着图像等视觉模态信息出现,传统的命名实体识别方法在多模态信息中表现不佳.近年来,多模态命名实...命名实体识别任务旨在识别出非结构化文本中所包含的实体并将其分配给预定义的实体类别中.随着互联网和社交媒体的发展,文本信息往往伴随着图像等视觉模态信息出现,传统的命名实体识别方法在多模态信息中表现不佳.近年来,多模态命名实体识别任务广受重视.然而,现有的多模态命名实体识别方法中,存在跨模态知识间的细粒度对齐不足问题,文本表征会融合语义不相关的图像信息,进而引入噪声.为了解决这些问题,提出了一种基于细粒度图文对齐的多模态命名实体识别方法(FGITA:A Multi-Modal NER Frame based on Fine-Grained Image-Text Alignment).首先,该方法通过目标检测、语义相似性判断等,确定更为细粒度的文本实体和图像子对象之间的语义相关性;其次,通过双线性注意力机制,计算出图像子对象与实体的相关性权重,并依据权重将子对象信息融入到实体表征中;最后,提出了一种跨模态对比学习方法,依据图像和实体之间的匹配程度,优化实体和图像在嵌入空间中的距离,借此帮助实体表征学习相关的图像信息.在两个公开数据集上的实验表明,FGITA优于5个主流多模态命名实体识别方法,验证了方法的有效性,同时验证了细粒度跨模态对齐在多模态命名实体识别任务中的重要性和优越性.展开更多
文摘Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotional states of speakers holds significant importance in a range of real-time applications,including but not limited to virtual reality,human-robot interaction,emergency centers,and human behavior assessment.Accurately identifying emotions in the SER process relies on extracting relevant information from audio inputs.Previous studies on SER have predominantly utilized short-time characteristics such as Mel Frequency Cepstral Coefficients(MFCCs)due to their ability to capture the periodic nature of audio signals effectively.Although these traits may improve their ability to perceive and interpret emotional depictions appropriately,MFCCS has some limitations.So this study aims to tackle the aforementioned issue by systematically picking multiple audio cues,enhancing the classifier model’s efficacy in accurately discerning human emotions.The utilized dataset is taken from the EMO-DB database,preprocessing input speech is done using a 2D Convolution Neural Network(CNN)involves applying convolutional operations to spectrograms as they afford a visual representation of the way the audio signal frequency content changes over time.The next step is the spectrogram data normalization which is crucial for Neural Network(NN)training as it aids in faster convergence.Then the five auditory features MFCCs,Chroma,Mel-Spectrogram,Contrast,and Tonnetz are extracted from the spectrogram sequentially.The attitude of feature selection is to retain only dominant features by excluding the irrelevant ones.In this paper,the Sequential Forward Selection(SFS)and Sequential Backward Selection(SBS)techniques were employed for multiple audio cues features selection.Finally,the feature sets composed from the hybrid feature extraction methods are fed into the deep Bidirectional Long Short Term Memory(Bi-LSTM)network to discern emotions.Since the deep Bi-LSTM can hierarchically learn complex features and increases model capacity by achieving more robust temporal modeling,it is more effective than a shallow Bi-LSTM in capturing the intricate tones of emotional content existent in speech signals.The effectiveness and resilience of the proposed SER model were evaluated by experiments,comparing it to state-of-the-art SER techniques.The results indicated that the model achieved accuracy rates of 90.92%,93%,and 92%over the Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS),Berlin Database of Emotional Speech(EMO-DB),and The Interactive Emotional Dyadic Motion Capture(IEMOCAP)datasets,respectively.These findings signify a prominent enhancement in the ability to emotional depictions identification in speech,showcasing the potential of the proposed model in advancing the SER field.
基金supported by National Key R&D Program of China(-NO.2017YFC0803700)National Nature Science Foundation of China(No.U1736206)+6 种基金National Nature Science Foundation of China(61671336)National Nature Science Foundation of China(61671332)Technology Research Program of Ministry of Public Security(No.2016JSYJA12)Hubei Province Technological Innovation Major Project(-No.2016AAA015)Hubei Province Technological Innovation Major Projec(2017AAA123)National Key Research and Development Program of China(No.2016YFB0100901)Nature Science Foundation of Jiangsu Province(No.BK20160386)
文摘Pedestrian attributes recognition is a very important problem in video surveillance and video forensics. Traditional methods assume the pedestrian attributes are independent and design handcraft features for each one. In this paper, we propose a joint hierarchical multi-task learning algorithm to learn the relationships among attributes for better recognizing the pedestrian attributes in still images using convolutional neural networks(CNN). We divide the attributes into local and global ones according to spatial and semantic relations, and then consider learning semantic attributes through a hierarchical multi-task CNN model where each CNN in the first layer will predict each group of such local attributes and CNN in the second layer will predict the global attributes. Our multi-task learning framework allows each CNN model to simultaneously share visual knowledge among different groups of attribute categories. Extensive experiments are conducted on two popular and challenging benchmarks in surveillance scenarios, namely, the PETA and RAP pedestrian attributes datasets. On both benchmarks, our framework achieves superior results over the state-of-theart methods by 88.2% on PETA and 83.25% on RAP, respectively.
基金This project is supported by Provincial Youth Science Foundation of Shanxi China (No.20011020)National Natural Science Foundation of China (No.59975064).
文摘The coal-rock interface recognition method based on multi-sensor data fusiontechnique is put forward because of the localization of single type sensor recognition method. Themeasuring theory based on multi-sensor data fusion technique is analyzed, and hereby the testplatform of recognition system is manufactured. The advantage of data fusion with the fuzzy neuralnetwork (FNN) technique has been probed. The two-level FNN is constructed and data fusion is carriedout. The experiments show that in various conditions the method can always acquire a much higherrecognition rate than normal ones.
基金Supported by Grant-in-Aid for Young Scientists(A)(Grant No.26700021)Japan Society for the Promotion of Science and Strategic Information and Communications R&D Promotion Programme(Grant No.142103011)Ministry of Internal Affairs and Communications
文摘Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.
文摘This paper presents two systems for recognizing static signs (digits) from American Sign Language (ASL). These systems avoid the use color marks, or gloves, using instead, low-pass and high-pass filters in space and frequency domains, and color space transformations. First system used rotational signatures based on a correlation operator;minimum distance was used for the classification task. Second system computed the seven Hu invariants from binary images;these descriptors fed to a Multi-Layer Perceptron (MLP) in order to recognize the 9 different classes. First system achieves 100% of recognition rate with leaving-one-out validation and second experiment performs 96.7% of recognition rate with Hu moments and 100% using 36 normalized moments and k-fold cross validation.
文摘Athletes have various emotions before competition, and mood states have impact on the competi- tion results. Recognition of athletes’ mood states could help athletes to have better adjustment before competition, which is significant to competition achievements. In this paper, physiological signals of female rowing athletes in pre- and post-competition were collected. Based on the multi-physiological signals related to pre- and post-competition, such as heart rate and respiration rate, features were extracted which had been subtracted the emotion baseline. Then the particle swarm optimization (PSO) was adopted to optimize the feature selection from the feature set, and combined with the least squares support vector machine (LS-SVM) classifier. Positive mood states and negative mood states were classified by the LS-SVM with PSO feature optimization. The results showed that the classification accuracy by the LS-SVM algorithm combined with PSO and baseline subtraction was better than the condition without baseline subtraction. The combination can contribute to good classification of mood states of rowing athletes, and would be informative to psychological adjustment of athletes.
基金Supported by the National Natural Science Foundation of China(No.61075022)
文摘The paper proposes a new method of multi-band signal reconstruction based on Orthogonal Matching Pursuit(OMP),which aims to develop a robust Ecological Sounds Recognition(ESR)system.Firstly,the OMP is employed to sparsely decompose the original signal,thus the high correlation components are retained to reconstruct in the first stage.Then,according to the frequency distribution of both foreground sound and background noise,the signal can be compensated by the residual components in the second stage.Via the two-stage reconstruction,high non-stationary noises are effectively reduced,and the reconstruction precision of foreground sound is improved.At recognition stage,we employ deep belief networks to model the composite feature sets extracted from reconstructed signal.The experimental results show that the proposed approach achieved superior recognition performance on 60 classes of ecological sounds in different environments under different Signal-to-Noise Ratio(SNR),compared with the existing method.
基金supported in part by the National Natural Science Foundation of China(NSFC)Key Program(61573094)Fundamental Research Funds for the Central Universities(N140402001)
文摘Understanding people's emotions through natural language is a challenging task for intelligent systems based on Internet of Things(Io T). The major difficulty is caused by the lack of basic knowledge in emotion expressions with respect to a variety of real world contexts. In this paper, we propose a Bayesian inference method to explore the latent semantic dimensions as contextual information in natural language and to learn the knowledge of emotion expressions based on these semantic dimensions. Our method synchronously infers the latent semantic dimensions as topics in words and predicts the emotion labels in both word-level and document-level texts. The Bayesian inference results enable us to visualize the connection between words and emotions with respect to different semantic dimensions. And by further incorporating a corpus-level hierarchy in the document emotion distribution assumption, we could balance the document emotion recognition results and achieve even better word and document emotion predictions. Our experiment of the wordlevel and the document-level emotion predictions, based on a well-developed Chinese emotion corpus Ren-CECps, renders both higher accuracy and better robustness in the word-level and the document-level emotion predictions compared to the state-of-theart emotion prediction algorithms.
基金supported by the National Natural Science Foundation of China under Grant No. 60973070
文摘An algorithm for face description and recognition based on multi-resolution with multi-scale local binary pattern (multi-LBP) features is proposed. The facial image pyramid is constructed and each facial image is divided into various regions from which partial and holistic local binary patter (LBP) histograms are extracted. All LBP features of each image are concatenated to a single LBP eigenvector with different resolutions. The dimensionaUty of LBP features is then reduced by a local margin alignment (LMA) algorithm based on manifold, which can preserve the between-class variance. Support vector machine (SVM) is applied to classify facial images. Extensive experiments on ORL and CMU face databases clearly show the superiority of the proposed scheme over some existed algorithms, especially on the robustness of the method against different facial expressions and postures of the subjects.
基金Supported by the National Defense Pre-Research Foundation of China (Grant No.9140A05070107BQ0204)
文摘In this paper,a new multiclass classification algorithm is proposed based on the idea of Locally Linear Embedding(LLE),to avoid the defect of traditional manifold learning algorithms,which can not deal with new sample points.The algorithm defines an error as a criterion by computing a sample's reconstruction weight using LLE.Furthermore,the existence and characteristics of low dimensional manifold in range-profile time-frequency information are explored using manifold learning algorithm,aiming at the problem of target recognition about high range resolution MilliMeter-Wave(MMW) radar.The new algorithm is applied to radar target recognition.The experiment results show the algorithm is efficient.Compared with other classification algorithms,our method improves the recognition precision and the result is not sensitive to input parameters.
文摘Segmenting Arabic handwritings had been one of the subjects of research in the field of Arabic character recognition for more than 25 years. The majority of reported segmentation techniques share a critical shortcoming, which is over-segmentation. The aim of segmentation is to produce the letters (segments) of a handwritten word. When a resulting letter (segment) is made of more than one piece (stroke) instead of one, this is called over-segmentation. Our objective is to overcome this problem by using an Artificial Neural Networks (ANN) to verify the resulting segment. We propose a set of heuristic-based rules to assemble strokes in order to report the precise segmented letters. Preprocessing phases that include normalization and feature extraction are required as a prerequisite step for the ANN system for recognition and verification. In our previous work [1], we did achieve a segmentation success rate of 86% but without recognition. In this work, our experimental results confirmed a segmentation success rate of no less than 95%.
文摘Deep Learning is a powerful technique that is widely applied to Image Recognition and Natural Language Processing tasks amongst many other tasks. In this work, we propose an efficient technique to utilize pre-trained Convolutional Neural Network (CNN) architectures to extract powerful features from images for object recognition purposes. We have built on the existing concept of extending the learning from pre-trained CNNs to new databases through activations by proposing to consider multiple deep layers. We have exploited the progressive learning that happens at the various intermediate layers of the CNNs to construct Deep Multi-Layer (DM-L) based Feature Extraction vectors to achieve excellent object recognition performance. Two popular pre-trained CNN architecture models i.e. the VGG_16 and VGG_19 have been used in this work to extract the feature sets from 3 deep fully connected multiple layers namely “fc6”, “fc7” and “fc8” from inside the models for object recognition purposes. Using the Principal Component Analysis (PCA) technique, the Dimensionality of the DM-L feature vectors has been reduced to form powerful feature vectors that have been fed to an external Classifier Ensemble for classification instead of the Softmax based classification layers of the two original pre-trained CNN models. The proposed DM-L technique has been applied to the Benchmark Caltech-101 object recognition database. Conventional wisdom may suggest that feature extractions based on the deepest layer i.e. “fc8” compared to “fc6” will result in the best recognition performance but our results have proved it otherwise for the two considered models. Our experiments have revealed that for the two models under consideration, the “fc6” based feature vectors have achieved the best recognition performance. State-of-the-Art recognition performances of 91.17% and 91.35% have been achieved by utilizing the “fc6” based feature vectors for the VGG_16 and VGG_19 models respectively. The recognition performance has been achieved by considering 30 sample images per class whereas the proposed system is capable of achieving improved performance by considering all sample images per class. Our research shows that for feature extraction based on CNNs, multiple layers should be considered and then the best layer can be selected that maximizes the recognition performance.
文摘命名实体识别任务旨在识别出非结构化文本中所包含的实体并将其分配给预定义的实体类别中.随着互联网和社交媒体的发展,文本信息往往伴随着图像等视觉模态信息出现,传统的命名实体识别方法在多模态信息中表现不佳.近年来,多模态命名实体识别任务广受重视.然而,现有的多模态命名实体识别方法中,存在跨模态知识间的细粒度对齐不足问题,文本表征会融合语义不相关的图像信息,进而引入噪声.为了解决这些问题,提出了一种基于细粒度图文对齐的多模态命名实体识别方法(FGITA:A Multi-Modal NER Frame based on Fine-Grained Image-Text Alignment).首先,该方法通过目标检测、语义相似性判断等,确定更为细粒度的文本实体和图像子对象之间的语义相关性;其次,通过双线性注意力机制,计算出图像子对象与实体的相关性权重,并依据权重将子对象信息融入到实体表征中;最后,提出了一种跨模态对比学习方法,依据图像和实体之间的匹配程度,优化实体和图像在嵌入空间中的距离,借此帮助实体表征学习相关的图像信息.在两个公开数据集上的实验表明,FGITA优于5个主流多模态命名实体识别方法,验证了方法的有效性,同时验证了细粒度跨模态对齐在多模态命名实体识别任务中的重要性和优越性.