论文标题
利用子词嵌入跨国地址解析
Leveraging Subword Embeddings for Multinational Address Parsing
论文作者
论文摘要
地址解析包括确定构成街道名称或邮政代码之类的地址的细分市场。由于其对记录链接等任务的重要性,因此已使用许多技术进行了地址解析。神经网络方法定义了一种新的最新地址解析。尽管这种方法取得了显着的结果,但以前的工作仅专注于应用神经网络以实现来自一个来源国家地址的地址解析。我们提出了一种方法,在该方法中,我们采用子词嵌入和一个经常性的神经网络体系结构来构建一个模型,能够同时学习从多个国家 /地区解析地址的单一模型,同时考虑了语言和地址格式化系统的差异。我们在不进行预处理或需要后处理的情况下,在用于培训的国家 /地区实现了约99%的准确性。我们探讨了通过对某些国家 /地区的地址进行培训获得的地址解析知识的可能性,而没有在零射击转移学习环境中进行进一步培训。我们为80%的国家(41个中的33个)取得了良好的成绩,其中几乎50%(41分之20)接近最先进的表现。此外,我们建议对经过训练的模型进行开源Python实施。
Address parsing consists of identifying the segments that make up an address such as a street name or a postal code. Because of its importance for tasks like record linkage, address parsing has been approached with many techniques. Neural network methods defined a new state-of-the-art for address parsing. While this approach yielded notable results, previous work has only focused on applying neural networks to achieve address parsing of addresses from one source country. We propose an approach in which we employ subword embeddings and a Recurrent Neural Network architecture to build a single model capable of learning to parse addresses from multiple countries at the same time while taking into account the difference in languages and address formatting systems. We achieved accuracies around 99 % on the countries used for training with no pre-processing nor post-processing needed. We explore the possibility of transferring the address parsing knowledge obtained by training on some countries' addresses to others with no further training in a zero-shot transfer learning setting. We achieve good results for 80 % of the countries (33 out of 41), almost 50 % of which (20 out of 41) is near state-of-the-art performance. In addition, we propose an open-source Python implementation of our trained models.