与时间窗口有关定向问题问题的增强学习方法

论文标题

与时间窗口有关定向问题问题的增强学习方法

A Reinforcement Learning Approach to the Orienteering Problem with Time Windows

论文作者

Gama, Ricardo, Fernandes, Hugo L.

论文摘要

定向启动问题与时间窗口（OPTW）是一个组合优化问题，其目标是最大化从不同访问的位置收集的总分数。神经网络模型在组合优化中的应用最近在处理类似问题（例如旅行推销员问题）方面表现出了令人鼓舞的结果。神经网络允许使用加强学习或监督学习的学习解决方案，具体取决于可用数据。在学习阶段之后，可以将其概括并快速进行微调，以进一步提高性能和个性化。优势是显而易见的，因为对于实际应用程序，解决方案质量，个性化和执行时间都是应考虑的重要因素。这项研究探讨了使用强化学习训练的指针网络模型来解决OPTW问题。我们提出了一个修改后的体系结构，该体系结构利用指针网络更好地解决与动态相关约束有关的问题。在其各种应用中，可以使用OPTW来对旅游旅行设计问题（TTDP）进行建模。我们要考虑到可以在参观特定实例区域的游客中进行更改的变量来培训指针网络：启动位置，开始时间，可用时间以及给出了每个兴趣点的分数。一旦训练了模型区域，它就可以使用光束搜索来推断特定游客的解决方案。我们基于对几个现有基准OPTW实例的评估。我们表明，它概括了访问每个地区的不同游客，并且在现实时期计算解决方案时，它通常优于最常用的启发式方法。

The Orienteering Problem with Time Windows (OPTW) is a combinatorial optimization problem where the goal is to maximize the total score collected from different visited locations. The application of neural network models to combinatorial optimization has recently shown promising results in dealing with similar problems, like the Travelling Salesman Problem. A neural network allows learning solutions using reinforcement learning or supervised learning, depending on the available data. After the learning stage, it can be generalized and quickly fine-tuned to further improve performance and personalization. The advantages are evident since, for real-world applications, solution quality, personalization, and execution times are all important factors that should be taken into account. This study explores the use of Pointer Network models trained using reinforcement learning to solve the OPTW problem. We propose a modified architecture that leverages Pointer Networks to better address problems related with dynamic time-dependent constraints. Among its various applications, the OPTW can be used to model the Tourist Trip Design Problem (TTDP). We train the Pointer Network with the TTDP problem in mind, by sampling variables that can change across tourists visiting a particular instance-region: starting position, starting time, available time, and the scores given to each point of interest. Once a model-region is trained, it can infer a solution for a particular tourist using beam search. We based the assessment of our approach on several existing benchmark OPTW instances. We show that it generalizes across different tourists that visit each region and that it generally outperforms the most commonly used heuristic, while computing the solution in realistic times.

下载PDF全文

下载文献需遵守相关版权规定

论文标题