论文标题
大量平行和异步的Tsetlin机器架构几乎恒定时间缩放
Massively Parallel and Asynchronous Tsetlin Machine Architecture Supporting Almost Constant-Time Scaling
论文作者
论文摘要
使用逻辑条款来表示模式,Tsetlin机器(TMS)最近在几种基准的准确性,记忆足迹,能量和学习速度方面获得了竞争性能。每个TM条款都投票支持或反对特定阶级,并以多数投票解决了分类。虽然对子句的评估很快,但基于二进制运算符,但投票使得有必要同步子句评估,并阻碍并行化。在本文中,我们提出了一个新的方案,用于使条款评估的评估,从而消除了投票的瓶颈。简而言之,每个子句都以其自己的线程进行,以进行大规模的本地并行性。对于每个培训示例,我们都会跟踪当地投票权中从条款中获得的课堂投票。当地投票的票数使我们能够从其他条款中分离每个条款的处理,从而支持分散的学习。这意味着TM大多数时候将以过时的投票率运作。我们评估了跨不同学习任务的拟议并行化,事实证明,我们的分散的TM学习算法很好地应对过时的数据,导致学习准确性没有显着损失。此外,我们表明所提出的方法可提供多达50倍的学习速度。最后,对于合理的条款量,学习时间几乎是恒定的(在Tesla V100 GPU上使用20到7,000个条款)。对于足够大的子句数字,计算时间大致增加了。因此,我们的并行和异步体系结构允许处理大量数据集并使用更多条款运行,以提高准确性。
Using logical clauses to represent patterns, Tsetlin Machines (TMs) have recently obtained competitive performance in terms of accuracy, memory footprint, energy, and learning speed on several benchmarks. Each TM clause votes for or against a particular class, with classification resolved using a majority vote. While the evaluation of clauses is fast, being based on binary operators, the voting makes it necessary to synchronize the clause evaluation, impeding parallelization. In this paper, we propose a novel scheme for desynchronizing the evaluation of clauses, eliminating the voting bottleneck. In brief, every clause runs in its own thread for massive native parallelism. For each training example, we keep track of the class votes obtained from the clauses in local voting tallies. The local voting tallies allow us to detach the processing of each clause from the rest of the clauses, supporting decentralized learning. This means that the TM most of the time will operate on outdated voting tallies. We evaluated the proposed parallelization across diverse learning tasks and it turns out that our decentralized TM learning algorithm copes well with working on outdated data, resulting in no significant loss in learning accuracy. Furthermore, we show that the proposed approach provides up to 50 times faster learning. Finally, learning time is almost constant for reasonable clause amounts (employing from 20 to 7,000 clauses on a Tesla V100 GPU). For sufficiently large clause numbers, computation time increases approximately proportionally. Our parallel and asynchronous architecture thus allows processing of massive datasets and operating with more clauses for higher accuracy.