摘要
基于计算机视觉的人体动作识别技术在视频监控、智能驾驶、人机交互、多媒体内容审核等领域均有着广阔的应用前景,其中人体动作中的人物交互是动作识别的核心内容之一。现有的人物交互动作识别模型对人物关系的提取仅仅停留在表层视觉特征之上,并未充分挖掘人体关键区域以及人物之间的深层语义关系。针对此问题,文中提出了层次化的图神经网络模型(HGNN)对人物交互动作建模。HGNN模型从局部到整体显式地对人体关键区域以及人和物构成的场景图进行建模,并利用注意力图池化机制(AttPool)剔除层次图中冗余的信息和噪声,再通过图卷积网络提取图结点之间的深层语义关系,对卷积网络提取的特征进行聚合与优化,从而得到反映人物交互动作本质的特征表示。另外,HGNN模型在中层图进行的临时监督分类也能够约束网络更好地学习到交互动作的人体模式,避免网络对交互对象产生“偏见”。最后,针对HGNN模型,设计了多任务损失函数,用于有效进行模型训练。为了验证HGNN模型的有效性,在公开的大型数据集V-COCO上进行了广泛的实验,结果均显示所提出的HGNN模型对常见的人物交互动作具有广泛的适应性和鲁棒性,精度(mAP)超过了现有的基于图神经网络的模型,同时领先于大部分最新的多流卷积模型。
Computer vision based human action recognition technique has a broad application in the fields of video surveillance,intelligent driving,human-computer interaction,multimedia content audit,etc.More importantly,human-object interaction is one of the core components in human action recognition.Most of the existing human-object interaction action recognition models,which are based on multi-stream convolutional neural networks,only capturing the visual features superficially.They fail to fully explore the key areas of human body and the deep semantic relationship between human and objects.To solve this problem,this paper proposes a hierarchical graph neural network(HGNN)model.HGNN explicitly models the critical areas of the human body and the interaction of human-object in the scene from local to global,and uses an attention pooling mechanism(AttPool)to eliminate redundant information and noise in the graph.Then,the deep semantic relationship between graph nodes are captured by the graph convolution network,and the initial features extracted by convolutional neural network are aggregated and optimized.In this way,the feature representation which reflects the essential character of human-object interaction can be obtained.In addition,the interim supervised classification in the middle graph can also constrain the model to better learn the human patterns of interactive actions,and avoid the model to produce“bias”on the interactive objects.Finally,a multi-task loss function is designed for the HGNN to effectively train the model.To test and verify the effectiveness of the proposed HGNN model,extensive experimental evaluations on the famous public benchmark V-COCO have been conducted.The results show that the proposed HGNN model is adaptive and robust for human-object interaction detection,which outperforms the previous graph neural network based methods by a large margin,and also performs better than most of the latest convolutional neural network based models.
作者
李宝珍
张晋
王宝录
余平
LI Bao-zhen;ZHANG Jin;WANG Bao-lu;YU Ping(Shendong Jinjie Colliery,Chn Energy,Shenmu,Shaanxi 719319,China;Chn Energy Network Infomation Technology(Beijing)CO.,LTD.,Beijing 100011,China)
出处
《计算机科学》
CSCD
北大核心
2022年第S02期643-650,共8页
Computer Science
关键词
计算机视觉
人体动作识别
人物交互
深度学习
图神经网络
Computer vision
Human action recognition
Human-Object interaction
Deeplearning
Graph neural network