视频会议的多模式主动扬声器检测和虚拟摄影

论文标题

视频会议的多模式主动扬声器检测和虚拟摄影

Multimodal active speaker detection and virtual cinematography for video conferencing

论文作者

Cutler, Ross, Mehran, Ramin, Johnson, Sam, Zhang, Cha, Kirk, Adam, Whyte, Oliver, Kowdle, Adarsh

论文摘要

主动扬声器检测（ASD）和虚拟摄影（VC）可以通过自动平移，倾斜和缩放视频摄像机来显着改善视频会议的远程用户体验：用户主观对专家视频摄影师的视频评分高于未经编辑的视频。我们描述了一种新的自动化ASD和VC，该ASD和VC基于主观评分（以1-5的比例）在专家摄影师的0.3 MOS内执行。该系统使用4K宽型相机，深度摄像头和麦克风阵列。它从每种模式中提取功能，并使用Adaboost机器学习系统训练ASD，该系统非常有效，可以实时运行。使用机器学习对VC进行了类似的培训，以优化整体体验的主观质量。为了避免分散房间参与者的注意力并减少系统没有运动部件的切换延迟 - 风险投资通过裁剪和缩小4K宽式视频流来工作。使用广泛的众包技术对系统进行调整和评估，并在n = 100次会议的数据集上进行评估，每2-5分钟的长度。

Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.

下载PDF全文

下载文献需遵守相关版权规定

论文标题