论文标题

动作胜于雄辩:浏览器指纹检测的半监督学习

Actions speak louder than words: Semi-supervised learning for browser fingerprinting detection

论文作者

Bird, Sarah, Mishra, Vikas, Englehardt, Steven, Willoughby, Rob, Zeber, David, Rudametkin, Walter, Lopatka, Martin

论文摘要

随着在线跟踪的不断增长,必须增强需要大量手动输入的现有反向跟踪和指纹检测技术。指纹检测的启发式方法是精确的,但必须仔细策划。提出的用于检测跟踪的监督机器学习技术需要手动生成的标签集。为了克服这些挑战,我们提出了一种半监督的机器学习方法来检测指纹脚本。我们的方法基于核心见解,即指纹脚本在生成指纹时具有类似的API访问模式,即使它们的访问模式可能不完全匹配。使用此见解,我们通过其JavaScript(JS)执行跟踪对脚本进行分组,并应用半监视的方法来检测新的指纹脚本。我们详细介绍了我们的方法论,并证明了它可以通过现有启发式技术识别的大多数脚本($ \ geqslant $ 94.9%)的能力。我们还表明,该方法通过浮出可能包括指纹识别的候选脚本来扩展,而不是检测已知脚本。通过对这些候选脚本的分析,我们发现了启发式方法错过的指纹脚本,并且没有启发式方法。特别是,我们确定了数百个域上存在的一百多个设备级指纹脚本。据我们所知,这是野外第一次测量设备级指纹。这些成功说明了稀疏的矢量表示和半监督的学习的力量,以补充和扩展现有的跟踪检测技术。

As online tracking continues to grow, existing anti-tracking and fingerprinting detection techniques that require significant manual input must be augmented. Heuristic approaches to fingerprinting detection are precise but must be carefully curated. Supervised machine learning techniques proposed for detecting tracking require manually generated label-sets. Seeking to overcome these challenges, we present a semi-supervised machine learning approach for detecting fingerprinting scripts. Our approach is based on the core insight that fingerprinting scripts have similar patterns of API access when generating their fingerprints, even though their access patterns may not match exactly. Using this insight, we group scripts by their JavaScript (JS) execution traces and apply a semi-supervised approach to detect new fingerprinting scripts. We detail our methodology and demonstrate its ability to identify the majority of scripts ($\geqslant$94.9%) identified by existing heuristic techniques. We also show that the approach expands beyond detecting known scripts by surfacing candidate scripts that are likely to include fingerprinting. Through an analysis of these candidate scripts we discovered fingerprinting scripts that were missed by heuristics and for which there are no heuristics. In particular, we identified over one hundred device-class fingerprinting scripts present on hundreds of domains. To the best of our knowledge, this is the first time device-class fingerprinting has been measured in the wild. These successes illustrate the power of a sparse vector representation and semi-supervised learning to complement and extend existing tracking detection techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源