1 Introduction

With the rapid development of big data and artificial intelligence, the analysis of customer visits and consumption habits has been widely applied in e-commerce, medical advice, urban transportation and online Q &A [1], and sequential recommendation systems have also become an important means of addressing information overload. When we navigate the Internet, many visited traces, such as shopping records, are generated. This information indicates our behavioural patterns or preferences, which can be utilized by a recommendation system [2, 3]. Such a platform usually mines dynamic changes based on a series of interactions between a user and his or her favourite items and then makes a recommendation concerning which items he or she may be interesting in purchasing in the near future.

In daily life, our behaviours are often sequential, so it is impossible to describe them using static traditional recommendation methods. An example of sequential behaviour is shown in Fig. 1. Unlike traditional algorithms, sequential recommendation methods can understand and model user behaviour sequences, user-item interactions, and temporal changes in user preferences and item popularity. This dynamic process, which considers sequential dependencies, enables more accurate recommendations to be obtained.

Fig. 1
figure 1

Sequential behaviour instance of a user, Tom

In the era of modern big data, large-scale recommendation systems examine millions of user interactions every day. While each user may have relatively stable interests and hobbies, they are also attracted by emerging trends and real-time hot topics. This involves the consideration of both long-term and short-term interests [4]. Long-term interests represent users’ preferences and interests over a longer time span, reflecting their stable preferences and long-term behavioural patterns. On the other hand, short-term interests represent users’ immediate preferences and interests over a shorter time span; these interests are typically influenced by the current context and environment and are temporary and variable. Recommendation systems typically need to consider both the long-term and short-term interests of users to provide more accurate and personalized recommendations.

In recent years, many researchers have also attempted to apply attention mechanisms [5] to the field of recommendation systems. Similar to the way humans selectively focus on specific content within their cognitive space when processing incoming signals and continuously shift their focus in time and space, attention mechanisms in neural networks have been shown to be effective in various types of machine learning tasks [6,7,8]. This enables models to dynamically allocate attention weights based on the importance of different parts of the input. Consequently, models can assign distinct weights to different parts according to their relevance during each time step or position. This capability allows the model to focus more effectively on the information that is most relevant to the current task.

Our proposed method aims to address the existing issues in current research on recommendation systems: (1) Previous multihead attention models have used the same positional embeddings for each attention head. This approach leads to the learning of redundant patterns and can result in information bottlenecks due to simplistic embedding schemes. (2) In terms of combining long-term and short-term interests, there is a need to capture the dynamic nature of user interests and how they evolve over time. Additionally, it is crucial to understand how these combined interests can effectively operate in more complex scenarios with information overload.

Based on the above factors, we propose a sequential recommendation model for balancing long- and short-term benefits (MSMT-LSI), which uses multiple forms of temporal embedding to extract a variety of different patterns from users’ sequential behaviours and make full use of temporal information. In addition, a multilayer long short-term self-attention network is used for sequence recommendation, where users’ habits or preferences over a period of time are captured separately by two multihead self-attention networks and eventually integrated to form a hybrid representation.

The remainder of this article is organized as follows. In Sect. 2, the relevant theoretical foundation is described. Section 3 introduces preliminary knowledge of the MSMT-LSI model. In Sect. 4, the improved MSMT-LSI model is proposed based on a multiheaded self-attention mechanism and multiple temporal embeddings of long-short-term interests. The relevant experiments and discussions are presented in Sect. 5. Finally, the conclusion is given in Sect. 6.

2 Related Work

Traditional models, including sequential pattern mining [9] and Markov chain (MC) models [10], have natural advantages in bridging the correlations among interactions in a sequence. The classic method Factorizing Personalized Markov Chains for Next-Basket Recommendation (FPMC) [11] integrates matrix factorization (MF) and MCs to capture the general preferences and short-term changes of the next basket, respectively. However, MC-based methods care about the local sequence pattern between two adjacent actions (or the most recent ones), so they can only capture short-term dependencies and may ignore the global information of the entire sequence.

Hidasi et al. proposed the session model GRU4Rec [12] using a recurrent neural network (RNN) based on a GRU in 2016, which also demonstrated the advantages of RNNs in the field of sequence recommendation. However, this approach considers only the current interest of the user and ignores overall interests. Moreover, since its loop structure depends on the last hidden state and the current action, the RNN also faces challenges such as low efficiency and large time consumption. In 2018, Tang and Wang proposed Caser [13], which uses a convolutional neural network (CNN) to learn the short-term characteristics of consumers and makes predictions in combination with general preferences. Since CNNs do not have a strong idea of sequence interaction, as they learn the pattern between regions in an image, they can compensate for the deficiency of RNN-based recommendation to a certain extent. However, CNN-based models do not effectively capture long-term dependencies because of the limited scope of their filters.

In sequential recommendation, the attention mechanism is often used to emphasize the truly relevant and important interactions, while ignoring those that are irrelevant to the next interaction [14]. The attention-based recommender system model AttRec [15] utilizes an attention mechanism to apply weighted attention to user behaviours and item features, allowing for dynamic allocation of attention weights to highlight important information. This enables the model to better capture user interests and item relevance, thereby improving recommendation accuracy and personalization level. Inspired by the attention-based SEQ-to-SEQ model Transformer [16], Kang et al. proposed the self-attention-based sequence recommendation model SASRec [17], which was significantly better than the MC/RNN/CNN method. Huang et al. proposed a unified contextual self-attention network called CSAN [18], which takes heterogeneous user behaviours into account and projects them into a common underlying semantic space. However, they do not consider a static timestamp when modelling dependencies. Li et al. proposed an interval-aware self-attention-based sequential recommendation model, TiSASRec [19], in 2020, which models both the relative time interval and the absolute position. However, this approach does not consider the combination of long-term and short-term preferences and models sequences in only one direction. Sun et al. proposed deep bidirectional modelling that simultaneously focuses on the contextual information of interaction sequences [20]. Rashed et al. seamlessly captured the dynamic nature and contextual changes of user profiles by proposing a context- and attribute-aware recommendation model (CARCA) [21]. Attention mechanisms can also combine CNNs and RNNs to address shortcomings. Tan et al. proposed a Bi-GRU network to model users’ long-term interests and short-term consumption motivation through attention and then combined them to predict users’ next behaviour [22]. Ying et al. proposed a two-layer hierarchical attention network, SHAN [23], in 2018 to capture and combine users’ long-term and short-term preferences. However, this approach ignores the temporal decay in long-term behaviour and user-item correlation in each feature dimension. In addition, many prior studies have aimed to address data sparseness [24,25,26]. Therefore, existing attention mechanisms have much room for improvement in modelling users’ dynamic and diverse preferences.

The above prior studies demonstrated the ability of the interval-based self-attention mechanism to perform well in sequence recommendation. The self-attention mechanism uses neither recursion nor convolution and is more flexible than RNNs or CNNs in dealing with the length of sequences, which gives it an advantage in modelling dependencies. In addition, it is easy to parallelize its implementation because only matrix multiplication is used to compute the full attentional weight of the sequence. The multiheaded self-attention mechanism (MSSAM) [16] is a representative model for parallelization.

However, similar to SASRec [17] and TiSASRec models [19], which use the same positional embedding in each attention head, the above methods are more wasteful in learning overlapping patterns, and simple embedding schemes can lead to information bottlenecks. The use of time by the interval-based self-attention mechanism is also limited to a single embedding scheme, and the information in the timestamps is not adequately used. Additionally, SASRec [17] and TiSASRec [19] are somewhat deficient in modelling both long-term and short-term interests. They do not emphasize current interests and are deficient in characterizing users’ long-term preferences and short-term needs, which must be considered for better performance [27].

3 Preliminaries

3.1 Attention Mechanism

In the field of sequence recommendation, to emphasize each behaviour in a sequence, an attention mechanism is used to assign different weights to different behaviours, emphasizing those that are helpful to the result and ignoring unhelpful choices. The transformer [16] uses scaled dot product attention, defined in Eq. (1). Here, d is used to prevent the inner product value from being too large.

$$\begin{aligned} {Attention}(Q,K,V)=\text {softmax}\left( \frac{QK^{T}}{\sqrt{d} } \right) V \end{aligned}$$
(1)

In Eq. (1), Q, K, and V represent the query, key and value, respectively. In self-attention [28], the three inputs usually use the same object.

Three vector sequences are generated for an input sequence \(X=\left[ x_{1},...,x_{N} \right] \in \mathbb {R}^{D_{x}\times N}\), as shown in Eq. (2):

$$\begin{aligned} \begin{aligned}Q&= W_{Q}X \in \mathbb {R}^{D_{k}\times N} \\ K&= W_{K}X \in \mathbb {R}^{D_{k}\times N} \\ V&= W_{V}X \in \mathbb {R}^{D_{v}\times N} \end{aligned} \end{aligned}$$
(2)

In Eq. (2), \(W_{Q} \in \mathbb {R}^{D_{k}\times D_{x}}\), where \(Q=\left[ q_{1},...q_{N} \right]\) is a query vector that represents the current information being queried and is typically represented by the model’s hidden state or previous outputs. \(K=\left[ k_{1},...k_{N} \right]\) is the key information from the input and is generated from the input’s feature representation. \(W_{V} \in \mathbb {R}^{D_{v}\times D_{x}}\), where \(V=\left[ v_{1},...v_{N} \right]\) is the actual input representation.

The output sequence is \(H=\left[ h_{1},...h_{N} \right] \in \mathbb {R}^{D_{v}\times N}\), and the scaled dot product is also used as the attention scoring function; the output vector sequence is shown in Eq. (3).

$$\begin{aligned} H=\text {softmax}\left( \frac{K^{T}Q}{\sqrt{D_{k}} } \right) V \end{aligned}$$
(3)

H denotes the weighted combination computed based on the input sequence and attention weights, and its interpretation depends on the specific application context and model design. In the context of recommendation systems, the output sequence corresponds to the recommended results that are likely to be of interest to the user.

3.2 Multihead Attention Mechanism

The multihead attention mechanism expands a query vector into multiple queries so that multiple sets of information can be selected from the input in parallel, where each attention head focuses on a different part of the input sequence. A representation of the multihead attention mechanism is shown in Eq. (4), and its schematic diagram is shown in Fig. 2. Here, h is the number of heads of multihead attention.

$$\begin{aligned} \begin{aligned} {MultiHead}(Q,K,V)&= {Concat}({head}_{1}, {head}_{2},...,{head}_{h})W_{O} \\ {\text {where}}\quad {head}_{i}&={Attention}(QW_{Q}^{i},KW_{K}^{i},VW_{V}^{i} ) \end{aligned} \end{aligned}$$
(4)
Fig. 2
figure 2

Multihead attention mechanism

For the self-attention mechanism, the input sequence is multiplied by the matrices \(W_{Q}\), \(W_{K}\), and \(W_{V}\) to obtain the three matrices Q, K, and V, respectively; then, the attention matrix is computed to obtain the output. As shown in Fig. 2, the multihead attention mechanism splits the input sequence into h subsequences. Each subsequence corresponds to a head and is mapped to a Query, Key or Value, resulting in the groups \(q_{i}\), \(k_{i}\) and \(v_{i}\), respectively. A scaled dot product operation is performed for each group, and the results are finally merged using Concat.

4 Proposed Model

4.1 Problem Description

The symbol definitions involved in the sequence recommendation problem are shown in Table 1. The sequence of user history interactions is defined as a behavioural action from time steps 1 through T. Figure 3 shows an example of a historical interaction sequence for user \(u_{y}\), where \(y\in \left[ 1,n \right]\), \(x_{i}\in \left[ 1,m \right]\), and len represents the length of the historical sequence. The objective of sequence recommendation is to obtain the top K results of the personalized ordering of m items according to the preferences of any user at any time point.

Table 1 Symbol definition
Fig. 3
figure 3

Sequence of historical behaviours for user \(u_{y}\)

4.2 Long and Short Sequence Sampling

For different users u, the lengths of their behavioural sequences are different. Therefore, shorter sequences can be filled by padding items from the left so that their lengths reach a uniform threshold len. However, for long sequences larger than len, the SASRec model [17] considers the latest len behaviours since the training time increases with length. However, this approach is inappropriate because while recent behaviour is important, earlier behaviour can also be helpful in finding the next interaction. Inspired by the Caser model [13], which finds the next behaviour by jumping, a sampling strategy for long sequences is designed.

For each user \(u \in U\) whose historical behaviour records are \(S^{u}=\left\{ S_{1}^{u}, S_{2}^{u}, \ldots , S_{T}^{u}\right\}\), long sequence sampling is required when \(T>len\). The conventional approach is uniform sampling, that is, evenly sampling \(\left[ 1,T \right]\) for each index (\(1<j<T\)) starting at \(S^{u}\), resulting in \(S^{u}=\left\{ S_{j}^{u}, S_{j+1}^{u}, \ldots , S_{z}^{u}\right\}\), where \(z=\min (T, j+len-1)\). However, while this approach can address a long sequence, it ignores the importance of recent behaviour. Therefore, the sampling probability \(p_{S}\) is used to balance two sampling situations, as shown in Eq. (5).

$$\begin{aligned} S^{u}=\left\{ \begin{array}{ll}\left\{ S_{j}^{u}, S_{j+1}^{u}, \ldots , S_{j+l e n-1}^{u}\right\} , &{} \text{ if } p=p_{S} \\ \left\{ S_{T-l e n+1}^{u}, \ldots , S_{T}^{u}\right\} , &{} \text{ if } p=1-p_{S}\end{array}\right. \end{aligned}$$
(5)

Here, \(p_{S}\) represents a hyperparameter of the sampling probability, and p represents its final selection. j represents the initial index, and its value is \(\left[ 1,T-len \right]\).

Figure 4 is an example of long sequence sampling, assuming \(T=8\) and \(len=6\). In this case, because \(T>len\), strategy I (uniform sampling) is executed with probability \(p_{S}\), and strategy II (the latest len behaviours are selected) is executed with probability \(1-p_{S}\). The circles represent interactions, where green represents unselected interactions and yellow represents selected interactions.

Fig. 4
figure 4

Sampling strategy for a long sequence (green for the original sequence, yellow for the sampled sequence)

4.3 MSMT-LSI Model Structure

Fig. 5
figure 5

Model structure of MSMT-LSI for balancing long- and short-term benefits

The MSMT-LSI sequential recommendation model is based on a self-attention mechanism and has advantages over traditional MC, CNN, and RNN models in terms of modelling long-term dependencies, attending to global information, and modelling contextual information. Compared to models like TiSASRec, in addition to optimizing the sampling strategy, MSMT-LSI uses multiple forms of temporal embedding to extract a variety of different patterns from a user’s sequential behaviour, as shown in Fig. 5, while capturing the user’s long-term and short-term preferences separately through two multiheaded self-attention networks and ultimately integrating long- and short-term representations. These optimization strategies give MSMT-LSI advantages in terms of parallel processing capability and capturing dynamic changes in user interests.

4.3.1 Input Layer

In the input layer, the extraction of user behaviour sequences involves two dimensions. First, we consider the user’s long-term sequence \(Lo^{u} =\left\{ i_{1},i_{2},...,i_{T} \right\}\), encompassing all their behaviours over a specific period. Second, we capture the user’s shorter sequences \(Sh^{u} =\left\{ i_{t_s},i_{t_{s+1}},...,i_{T} \right\}\), reflecting their recent behaviours, with the length determined by the value of a specific time scale \(t_s\). This approach enables our model to simultaneously uncover the user’s long-term stable interests and account for any shifts in their short-term interests, thus obtaining a fusion of long- and short-term interests.

Furthermore, to impose constraints and standardize the lengths of the user and behaviour sequences inputted into the model, we adopt the long- and short-sequence sampling strategy mentioned in Sect. 4.2. By setting different length thresholds, we obtain two behavioural sequences: a long sequence \(sl =\left\{ s_{1},s_{2},...,s_{mlen} \right\}\) and a short sequence \(ss =\left\{ s_{1},s_{2},...,s_{nlen} \right\}\).

4.3.2 Embedding Layer

The embedding matrix \(\textbf{M}^{I} \in \mathbb {R}^{|I |\times d}\) of items is first created and filled with zero vectors, where d is its embedding dimension. For each sequence, its embedding vector can be found, and then its matrix \(\textbf{E}^{I} \in \mathbb {R}^{l e n \times d}\) is obtained, where \(\textbf{E}_{i}^{I}=\textbf{M}_{S_{i}}\). Then, we obtain matrices \({\textbf {E}} ^{IL}\in \mathbb {R}^{mlen \times d}\) and \({\textbf {E}} ^{IS}\in \mathbb {R}^{nlen \times d}\) for the two sequences. For the timestamps \(t=\left\{ t_{1},t_{2},...,t_{mlen} \right\}\) of long sequences ls, we also need to consider multiple temporal embeddings. A single embedding approach, as in previous studies, may not be sufficient to fully account for user behavioural patterns; therefore, we use a different encoding function for each embedding.

The first method is the most common positional embedding. \({\textbf {E}} ^{P}\) is obtained from a positional embedding matrix \({\textbf {M}} ^{P}\in \mathbb {R}^{mlen \times d}\), where an index of the time array is used instead of its timestamp value. The second is day embedding, which uses the day of each timestamp to construct the embedding according to a balance of the matrix size and validity. \({\textbf {E}} ^{D}\) is obtained by a learnable embedding matrix \({\textbf {M}} ^{D}\in \mathbb {R}^{|D |\times d}\), where \(|D |\) denotes the number of days. This approach can prevent the model from being trapped at the same or similar days. The third method uses the time difference information to encode the interaction relationship. First, we define the time difference matrix \({\textbf {T}}\in \mathbb {R}^{mlen \times mlen}\); its elements are calculated as shown in Eq. (6).

$$\begin{aligned} dis_{ab} =\frac{({\textbf {t}}_{a} - {\textbf {t}}_{b})}{\tau } \end{aligned}$$
(6)

In Eq. (6), a and b are two historical items in a behavioural sequence. \(\tau\) is an adjustable unit with a time difference parameter.

There are three ways to encode a time difference matrix \({\textbf {T}}\in \mathbb {R}^{mlen \times mlen}\). sin turns \(dis_{ab}\) into a hidden vector \(\overrightarrow{\theta }_{ab}\), as shown in Eq. (7). It is used to capture periodic events in a sequential behaviour. \(\overrightarrow{\theta }_{ab,c}\) represents the c-th value of \(\overrightarrow{\theta }_{ab}\), and freq is an adjustable parameter. Encodings exp and log convert \(dis_{ab}\) into hidden vectors \(\overrightarrow{e}_{ab}\) and \({\overrightarrow{l}_{ab}}\), as shown in Eqs. (8) and (9), respectively. They are designed to increase the diversity; the larger the time interval is, the closer exp is to zero, and the more log increases.

$$\begin{aligned}{} & {} \overrightarrow{\theta }_{ab,2c} = \sin \left( \frac{dis_{ab} }{freq^{\frac{2c}{d} }} \right) , \quad \overrightarrow{\theta }_{ab,2c+1} = \cos \left( \frac{dis_{ab} }{freq^{\frac{2c}{d} }} \right) \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \overrightarrow{e}_{ab,c} = exp \left( \frac{-|dis_{ab}|}{freq^{\frac{c}{d} }} \right) \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \overrightarrow{l}_{ab,c} = log \left( 1+\frac{|dis_{ab}|}{freq^{\frac{c}{d} }} \right) \end{aligned}$$
(9)

Stacking these vectors yields three embedding methods: \({\textbf {E}} ^{S}\), \({\textbf {E}} ^{E}\) and \({\textbf {E}} ^{L}\).

4.3.3 Self-Attention Network

The overall schematic of the self-attention network is shown in Fig. 6.

Fig. 6
figure 6

Schematic diagram of the self-attention network

We next take a long sequence \({\textbf {E}}^{IL}\) as an example. The multihead self-attention mechanism in Transformer [16] has also shown good results in the sequence recommendation domain. The traditional self-attention-based approach uses the sum of item embedding \({\textbf {E}}^{IL}\) and position embedding \({\textbf {E}}^{P}\), which is a simple absolute position encoding [29, 30], as shown in Fig. 7a. This approach is used for short sequences \({\textbf {E}}^{IS}\). Several prior studies have improved upon this approach [31, 32] by adding relative position embedding (the difference in array position indices), as shown in Fig. 7b. The global content and positional deviation are also considered, and \({\textbf {E}}^{S}\), \({\textbf {E}}^{E}\), and \({\textbf {E}}^{L}\) are used in this way. In addition, for absolute embedding, better results are obtained by separating the item embedding from the positional embedding and then combining the query and key separately instead of simply adding them, as shown in Fig. 7c, which is used by \({\textbf {E}}^{P}\) and \({\textbf {E}}^{D}\). To allow the models to combine information from different spaces at the same time, we use the multihead self-attention mechanism, which means that we use h independent self-attention models with different parameters to process information in parallel and then stitch the outputs of all the models to get the results, as shown in Fig. 8. In addition, the dimension of each head of multihead attention is \(\frac{d}{h}\), while the number of heads h and the embedding type are adjustable according to the characteristics of the dataset.

Fig. 7
figure 7

Schematic diagram of self-attentive structural embedding

Fig. 8
figure 8

Schematic diagram of a multihead self-attention network

After the above multihead self-attention layer (defined as the MHSA layer), for the input sequence \({\textbf {E}}^{IL}=\left\{ m_{S_{1}},m_{S_{2}},...,m_{S_{mlen}} \right\}\), the output of the self-attention network is defined as \({\textbf {Z}}^{L}=\left\{ z_{1}^{L},z_{2}^{L},...,z_{mlen}^{L} \right\}\), and two layers of the feed-forward network (defined as the FFN layer) are applied after each multihead self-attention module, as shown in Eq. (10).

$$\begin{aligned} FFN(z_{i}^{L}) = RELU(z_{i}^{L}W_{1}+b_{1})W_{2}+b_{2} \end{aligned}$$
(10)

Layer normalization, residual connection and dropout are employed to suppress problems such as overfitting, as shown in Eq. (11). After stacking multiple self-attention modules, the lth module is as shown in Eq. (12).

$$\begin{aligned}{} & {} {\textbf {Z}}_{i}^{L} =z_{i}^{L}+Dropout\left( FFN\left( LayerNorm\left( z_{i}^{L} \right) \right) \right) \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \begin{aligned} {\textbf {S}}^{L(l)}&= MHSA\left( {\textbf {Z}}^{L(l-1)} \right) \\ {\textbf {Z}}^{L(l)}&= FFN\left( {\textbf {S}}^{L(l)} \right) \end{aligned} \end{aligned}$$
(12)

Similarly, this process is performed for a short sequence \({\textbf {E}}^{IS}\), and its output \({\textbf {Z}}^{S}=\left\{ z_{1}^{S},z_{2}^{S},...,z_{nlen}^{S} \right\}\) is obtained through its self-attention network.

4.3.4 Forecasting Layer

The prediction layer calculates the inner product of the information previously extracted from the self-attention module with all the item embedding vectors to represent the correlation between them and determines the most relevant top K items (those with the highest correlation scores) as the result. The correlation score for long sequences is calculated as shown in Eq. (13):

$$\begin{aligned} R_{i,t}^{L}={\textbf {Z}}_{t}^{L}{} {\textbf {M}}_{i}^{I} \end{aligned}$$
(13)

In Eq. (13), \({\textbf {M}}_{t}^{I} \in \mathbb {R}^{d}\) is the embedding vector of item i.

The same operation is performed for short sequences, resulting in \({\textbf {R}}_{i,t}^{S}\). We set t of the long series to mlen and the short series to nlen to predict the next interaction. Finally, the long-term and short-term correlations of interest are fused (various methods of fusion can be used, such as summing, averaging, maxima and minima), and the next interaction is calculated by summing and fusing, as shown in Eq. (14).

$$\begin{aligned} y_{i}=\left( {\textbf {Z}}_{mlen}^{L}+{\textbf {Z}}_{nlen}^{S} \right) {\textbf {M}}_{i}^{I} \end{aligned}$$
(14)

In Eq. (14), \(y_{i}\) denotes the score of the i-th candidate item in the set of all items. \({\textbf {Z}}_{mlen}^{L}\) and \({\textbf {Z}}_{nlen}^{S}\) represent the mlenth and nlenth rows of the two matrices, respectively.

We obtain a vector \({\textbf {y}}\) of scores for all possible items and sort them in descending order, where the top K items are the ones we recommend.

4.4 Recommendation Process Based on MSMT-LSI

A paper recommendation process is used to demonstrate the interpretability of MSMT-LSI. As shown in Fig. 9, first, the relevant data recommended in the literature are obtained from the system log or the data centre of the relevant website, and then the data are preprocessed, including by self-defined condition filtering of the data, sorting by timestamp and conversion of appropriate data types. The data are subsequently input into the MSMT-LSI model, after which the recommendation results of the top 10 papers are obtained.

Fig. 9
figure 9

Process of literature recommendation

Table 2 Top 10 recommendation results for five users

The CiteULike dataset [33] is adopted to preliminarily demonstrate the recommendation ability of the MSMT-LSI model. The data are merged and extracted to obtain tuples. Items and users with fewer operations are filtered out to obtain \(H_{sort}=\left\langle uid,pid,pt \right\rangle\). The results are subsequently sorted and converted to obtain \(H_{filter}=\left\langle uid,pid,pt \right\rangle\). Finally, two sequences are obtained, which correspond to the paper_id sequence and the read_paper_time sequence in the figure as the input of the model. The model predicts the IDs of the top 10 papers that the current user is most likely to be interested in next.

In addition, Table 2 shows a comparison of the preliminary prediction results between our model and the SASRec model [17]. Among the top 10 recommended results for five users, SASRec achieved a hit rate of 2/5, while MSMT-LSI achieved a hit rate of 3/5. In the case of user 9, the results from MSMT-LSI had a higher overall ranking in terms of relevance, which indicates that MSMT-LSI exhibits superior performance in terms of accuracy and sequentiality.

5 Experimental Results and Analysis

5.1 Datasets

To verify the effectiveness of the MSMT-LSI model, five real-world datasets were selected for the experiments. MovieLens-1M [34] is a benchmark dataset in the field of recommendation and contains the evaluation score and temporal information from multiple users for movies. Amazon Beauty [35] and Amazon Games [35] use highly sparse product rating data from Amazon.com. The Gowalla dataset [36] was obtained from a location-based social networking site and contains user location check-ins between February 2009 and October 2010. The MovieLens-10M [34] dataset was also selected.

All of these datasets contain tuple data of the form (user, item, rating data, timestamp). The sparsity of the five datasets is shown in Fig. 10. For example, in the Beauty and Games datasets, the proportion of sequences longer than 50 is less than 5%. We refer to datasets with such characteristics as sparse. The Beauty and Games datasets are much sparser than the MovieLens-1M dataset is, and the sequence length is also much shorter. The MovieLens-10M and Gowalla datasets are dense datasets with much longer sequence lengths; thus, the distribution of the denseness and sparsity of these five selected datasets can cover most cases.

Fig. 10
figure 10

Dataset sparsity analysis

To reflect that users’ interests change during time migration, the latest time dataset of MovieLens (ml-latest-small) is analyzed, as shown in Fig. 11. As shown in Fig. 11, from 2011 to 2013, the scores of all kinds of movies were high; however, they declined in 2017, indicating that users’ interests changed with time.

Fig. 11
figure 11

Time analysis of the latest MovieLens dataset

During the experiment, the data were preprocessed according to the above methods. MovieLens-1M is suitable for long-sequence sampling, but Amazon Beauty and Games do not apply this method because the sequences are too short. For all the datasets, we treat the ratings as implicit feedback. For the MovieLens-1M, Amazon Beauty, and Amazon Games datasets, items and users with fewer than 5 operations are filtered out, and for the MovieLens-10M and Gowalla datasets, items and users with fewer than 20 operations are filtered out; then, the data are processed and sorted in timestamp order. Since the data sequences are sorted, we use the same leave-one-out method to divide the dataset. The statistical information of the preprocessed datasets is shown in Table 3, where #Mean and #Max refer to the average and maximum values of the sequence length, respectively.

Table 3 Statistics for different datasets

5.2 Evaluation Metrics

In the context of Top-K recommendation problems, two categories of evaluation metrics are commonly used. The first category focuses on assessing the accuracy of the recommendation results, while the second category focuses on evaluating the order or ranking of the recommendation results.

5.2.1 Hit Ratio

The hit ratio (HR) [17] is a commonly used metric for measuring recall. It quantifies the proportion of successful recommendations made by the system. The calculation of HR is described by Eq. (15).

$$\begin{aligned} HR@K=\frac{NumberOfHits@K}{GT} \end{aligned}$$
(15)

In Eq. (15), NumberOfHits@K represents the total count of recommended items (K items) for each user in the test set. GT is the sum of the total number of items per user in the test set.

5.2.2 Normalized Discounted Cumulative Gain

The normalized discounted cumulative gain (NDCG) [17, 37] is a commonly used evaluation metric that considers the order or ranking of the recommended results. It places the items that the user prefers higher in the ranking. The calculation method for the NDCG is described by Eqs. (16)–(18).

$$\begin{aligned}{} & {} DCG@K=\sum _{i=1}^{K}\frac{2^{rel_{i} }-1 }{\log _{2}{\left( i+1 \right) } } \end{aligned}$$
(16)
$$\begin{aligned}{} & {} IDCG@K=\sum _{i=1}^{K}\frac{1 }{\log _{2}{\left( i+1 \right) } } \end{aligned}$$
(17)
$$\begin{aligned}{} & {} NDCG@K=\frac{DCG@K}{IDCG@K} \end{aligned}$$
(18)

In Eq. (16), \(rel_{i}\) represents the relevance or rating of the user with the ith item in the Kth list. In Eqs. (17) and . (18), IDCG@K represents the ideal DCG@K value.

5.3 Model and Experiment Setup

The pairwise objective function of BPR optimization [2] is chosen as the loss function, and BPR is used to generate negative samples in equal proportion to positive samples. The model is trained by minimizing Eq. (19). For user u, his/her preference for interaction item i is greater than that for noninteraction item j.

$$\begin{aligned} \sum _{i=2}^{mlen}\sum _{j\notin S^{u}}-\ln \sigma \left( R_{i,t}^{L}-R_{j,t}^{L} \right) +\sum _{\tilde{i}=2}^{nlen}\sum _{\tilde{j} \notin S^{u}}-\ln \sigma \left( R_{\tilde{i},t}^{S}-R_{\tilde{j},t}^{S} \right) +\lambda \left\| \theta \right\| ^{2} \end{aligned}$$
(19)

In Eq. (19), \(\theta\) represents a set of parameters. \(\sigma (x)\) is the sigmoid function \(\sigma (x)=\frac{1}{1+\text {e}^{-x}}\), which transforms the scores into the range (0, 1).

During the testing process, to facilitate comparison with SASRec [17], for each positive sample, 100 negative samples were generated, which were then sorted and tested.

The experimental environment was the Ubuntu19.10 operating system, an RTX2080Ti graphics card, and an i7-9700k CPU. The code environment was Python 3.6, which is based on the TensorFlow framework. The adaptive moment estimation (Adam) optimizer [38] was used in the MSMT-LSI model during training. The time scale \(t_s\) was set to one day, which means that the short sequences were from the last day. The dropout rate was set to 0.2 or 0.5. The learning rate was set to 0.001, the batch size was set to 512, and training was terminated if there was no performance improvement after 20 epochs. The number of self-attention module stackings l was set to 2. The other parameters, including the embedding dimension d, long sequence sampling probability \(p_{S}\), maximum sequence length mlen and nlen, and fusion strategy, are given in Sect. 5.4. All the parameters are summarized in Table 4.

Table 4 Parameter setting

AttRec [15], SASRec [17] and TiSASRec [19] were chosen for the comparative experiments. AttRec [15] models long- and short-term interests, and TiSASRec [19] considers time interval information for recommendation. All of the above models are based on the self-attention mechanism, so they are comparable. The parameters of the contrasting models are set according to the default values suggested in the corresponding literature mentioned above.

5.4 Parameter Sensitivity Analysis

To ensure the model’s generalization performance, we conducted a parameter sensitivity analysis on MovieLens-1M, which has a greater proportion of long sequences. The analysis included examining the sensitivity of the model to the embedding dimension d and long sequence sampling probability \(p_{S}\). The results are shown in Figs. 12 and 13.

Based on the results in Fig. 12, increasing d improves the model’s performance. However, a larger d leads to more parameters in the embedding layer and a slower training speed. Considering the comprehensive performance and training time, we set d to 50 for the subsequent comparative experiments.

Fig. 12
figure 12

Parameter sensitivity analysis of the embedding dimension d

Fig. 13
figure 13

Parameter sensitivity analysis of the long sequence sampling probability \(p_{S}\)

We also analyzed \(p_{S}\) using the MovieLens-1M dataset. The model’s performance is shown in Fig. 13. Regarding NDCG@10 and HR@10, we set \(p_{S}\) to 0.2 for the subsequent comparative experiments on dense datasets.

Then, we performed parameter sensitivity analyses on the models on the MovieLens-10M and Gowalla datasets with larger sample sizes, including the maximum sequence length mlen and nlen and fusion strategies (Sum, Mean, Max and Min).

First, for the maximum sequence lengths mlen and nlen, we performed a grid search from 100 to 500, that is, \(mlen=\left\{ 100,200,300,400,500\right\}\), and a grid search from 10 to 100, that is, \(nlen=\left\{ 10,30,50,100\right\}\). The performances of NDCG@10 on the two datasets are shown in Tables 5 and 6. The first row and column represent the values of mlen and nlen, respectively.

Table 5 NDCG@10 on MovieLens-10M as mlen and nlen change
Table 6 NDCG@10 on Gowalla as mlen and nlen change

From Tables 5 and 6, with increasing mlen and nlen, the performance of the model improves to different degrees but becomes saturated at more than 400 and more than 50. NDCG@10 reaches 0.7291 and 0.7267 when mlen is fixed at 400, 50 and 100 on the MovieLens-10M dataset. NDCG@10 reaches 0.7291 and 0.7259 when mlen is fixed at 50 and nlen is set to 400 and 500, respectively, on the MovieLens-10M dataset. A certain degree of degradation in performance occurs, probably because the information in the sequences has already been mined as much as possible, so increasing the length does not bring much gain. Moreover, as the amount of sequence data increases, the training time of the model is bound to increase. In Table 6, when mlen is set to 200 and nlen is set to 50, the model takes 2165.77 s to train for one epoch. However, when mlen is increased to 400, the training time increases to 3956.3 s. This observation provides further evidence for the above statement. Therefore, in subsequent experiments, mlen=200 and nlen=50 are chosen.

The following experiments are conducted for the fusion strategy. We keep the other parameter settings as described in Sect. 5.3, and the fusion strategy uses Sum, Mean, Max, and Min; the performances of HR@10 and NDCG@10 on all the datasets are shown in Table 7.

Table 7 Performance on five datasets with different fusion strategies

Table 7 shows that the Sum strategy works best on our selected datasets, both sparse and dense, while the Min strategy performs the worst. For example, on the MovieLens-1M dataset, HR@10 and NDCG@10 of the model are 0.8302 and 0.5987, respectively. Similarly, HR@10 and NDCG@10 of the model reach 0.8006 and 0.5682, respectively, when using the strategy Min, which are both worse than those of the strategy Sum. On the Amazon Beauty dataset, HR@10 and NDCG@10 are 0.4852 and 0.3321, respectively, when using the strategy Sum, while 0.4391 and 0.2974, respectively, are obtained when using the strategy Min, which shows that Min is also less effective than Sum on sparse datasets. This is because Sum retains more information about long- and short-term interests in the model, whereas Min obtains less information that is of poorer quality. Additionally, the strategy Mean performs well on dense datasets, but Mean does not perform as well as Max on sparse datasets. For example, in the MovieLens-10M dataset, HR@10 and NDCG@10 are 0.9311 and 0.7265, respectively, for the Sum strategy and 0.9323 and 0.7266, respectively, for the Mean strategy, which shows that Sum and Mean are comparable and superior, respectively, on the dense dataset. However, in the Amazon Games dataset, HR@10 and NDCG@10 are 0.7513 and 0.5466, respectively, for the strategy Sum. HR@10 and NDCG@10 reach 0.7211 and 0.5064, respectively, for the strategy Mean. This indicates that Mean does not perform as well as Max on the sparse dataset. This may be because Max pays more attention to which of the long-term and short-term interests make up the major part of the data, which is especially important when the data are sparse. In summary, we choose Sum as the fusion strategy in the subsequent experiments.

After the previous parameter sensitivity analysis, the embedding dimension was set to 50. The maximum sequence length was set to 200 for long sequences and 50 for short sequences, and the fusion strategy was set to Sum.

5.5 Performances of Different Models

The performance of MSMT-LSI compared with those of the other models on the five datasets is shown in Table 8. HR10 represents the evaluation metric HR@10, and NDCG10 represents NDCG@10. The bolded data in Table 8 represent the best performance on the evaluation metrics, and the underlined data represent the second-best performance on the evaluation metrics.

Table 8 Performance of different models on five datasets

In Table 8, on the MovieLens-1M dataset, MSMT-LSI obtains improvements of 2.14% and 2.85% on HR@10 and NDCG@10, respectively, compared to TiSASRec [19]. On Amazon Beauty, MSMT-LSI achieves 4.97% and 3.7% improvements on the two evaluation metrics compared to TiSASRec [19]. On Amazon Games, the MSMT-LSI model achieves 4.37% and 5.97% improvements on the two evaluation metrics compared to TiSASRec [19]. On MovieLens-10M, MSMT-LSI achieves 1.57% and 3.64% improvements on the two evaluation metrics compared to TiSASRec [19]. On the Gowalla dataset, MSMT-LSI obtains an improvement of 2.31% and 3.06% on the two evaluation metrics compared to TiSASRec [19]. This is because, first, multiple forms of temporal embedding are helpful for extracting a variety of different patterns from user sequential behaviours, allowing us to make full use of temporal information. In addition, we use a multilayer long short-term self-attention network for sequence recommendation. This network obtains the user’s long-term general interests through long-term sequence behaviours and the user’s recent current interests through short-term sequence behaviours. Then, these two user representations are combined to play different roles through a fusion strategy, and the attention weights on the items are calculated through the self-attention network to obtain accurate results.

Additionally, we can see that AttRec [15] is not as effective as TiSASRec [19] or our model. For example, on MovieLens-1M, MSMT-LSI obtains 0.8302 and 0.5987 on HR@10 and NDCG@10, respectively. The TiSASRec model [19] obtains 0.8088 and 0.5702, respectively. The AttRec model [15] obtains 0.7655 and 0.5652, respectively, so it is not as effective as TiSASRec [19] or our model. This may be because it does not fully consider timestamp information or time interval information. TiSASRec [19] considers time interval information for recommendations, so its performance is better than that of SASRec [17] on all datasets, but it is not as effective as MSMT-LSI because it uses a single embedding.

The MovieLens-1M, 10M and Gowalla datasets, which are dense datasets, also contain very rich sequential behavioural and temporal information, and all the models are more effective on these datasets. The two sparse datasets, Amazon Beauty and Amazon Games, contain less sequential behavioural and temporal information, and the models do not work as well on them as on the dense datasets; the effect is worse on Amazon Beauty. TiSASRec [19] works better than SASRec [17] and AttRec [15] because it fully considers the timestamp information and time interval information. However, it does not consider the combination of short- and long-term interests, which means that the results are not as good as those of our proposed MSMT-LSI model. MSMT-LSI obtains better results than the other models on both dense and sparse datasets, which suggests that our use of temporal information as well as modelling and combining short- and long-term interests yields better results.

5.6 Ablation Experiments

To explore whether the multiple temporal embedding strategies of the MSMT-LSI play a role in the model results, we performed ablation experiments on the MovieLens-1M dataset with the other parameters fixed. We denote the models that use only the embedding \({\textbf {E}}^{P}\) for each attention head and do not use long- or short-term interest binding as MSMT-LSI-EP (EP); the models that use only \({\textbf {E}}^{D}\) as MSMT-LSI-ED (ED); the models that use only \({\textbf {E}}^{S}\) as MSMT-LSI-ES (ES); the models that use only \({\textbf {E}}^{E}\) as MSMT-LSI-EE (EE); the models that use only \({\textbf {E}}^{L}\) as MSMT-LSI-EL (EL); the hybrid model representatives that combine pairs of embeddings as EP+ED (PD), ES+EE (SE), and ES+EL (SL); and the hybrid model using four embeddings as EP+ED+ES+EL (PDSL). The NDCG@10 performance of each model on the MovieLens-1M dataset is shown in Fig. 14.

Fig. 14
figure 14

HR@10 performance of each ablation model on the MovieLens-1M dataset

As shown in Fig. 14, for the MovieLens-1M dataset, the ES, EE, and EL models using relative position encoding are more effective than the EP and ED models using the absolute position encoding and have a larger gain for the final results; EL has the largest gain, with a performance of 0.5744 for NDCG@10. It can also be seen that the four-embedding hybrid models representing EP+ED+ES+EL have the best results among all the ablation models, with a performance of 0.5891 for NDCG@10, which demonstrates that multiple forms of temporal embedding can extract different patterns from the user’s sequential behaviour and can make fuller use of the temporal information, thus achieving better results.

To explore whether the combined long- and short-term interest strategy of the MSMT-LSI model plays a role in the model results, we conducted ablation experiments on the MovieLens-10M and Gowalla datasets. Since we use the long- and short-term interest fusion strategy both in the prediction layer and in the training loss function, we denote the model that uses only long sequences for prediction and for the loss function as MSMT-LSI-LLoss; the model that uses only short sequences for prediction and for the loss function as MSMT-LSI-SLoss; the model that uses only long sequences for prediction and a loss function that includes a fusion of short and long sequences as MSMT-LSI-L; and the model that uses only short sequences for prediction and a loss function that includes a fusion of short and long sequences as MSMT-LSI-S. The results of each model on the dataset are shown in Figs. 15 and 16.

Fig. 15
figure 15

HR@10 performance of the ablation models on the two datasets

Fig. 16
figure 16

NDCG@10 performance of the ablation models on the two datasets

As shown in Figs. 15 and 16, there are some differences in the performances of the ablation models on HR@10 and NDCG@10 on the two datasets, with MSMT-LSI consistently obtaining the best results, illustrating that combining long-term and short-term interests has a positive effect on the sequence recommendation results. In most cases, the MSMT-LSI-L and MSMT-LSI-LLoss models work better than MSMT-LSI-S and MSMT-LSI-SLoss. This is because the latter models only include some recent short-term preference behaviours of the user; they may ignore the long-term preferences of the user and thus cannot effectively capture the dynamic changes in the user’s interest. The former uses longer sequences, thus obtaining more effective and fuller information and facilitating more accurate recommendation. For example, on the MovieLens-10M dataset, the HR@10 performances of the MSMT-LSI-L and MSMT-LSI-LLoss models are 0.9277 and 0.9231, respectively, whereas the HR@10 performances of the MSMT-LSI-S and MSMT-LSI-SLoss models are 0.9201 and 0.9106, respectively. Furthermore, in most cases, MSMT-LSI-L outperforms MSMT-LSI-LLoss, and MSMT-LSI-S outperforms MSMT-LSI-SLoss, which suggests that the use of a loss function that incorporates the fusion of long and short sequences brings a greater gain to the final result than the use of a loss function that only uses long or short sequences. On the MovieLens-10M dataset, the reason why MSMT-LSI-L performs slightly worse than MSMT-LSI-S on NDCG@10 may be related to the sequence characteristics of the dataset itself, as this is a dataset where short-term interest is more important to the final result than long-term interest. In conclusion, our enhancements result in significant performance gains in the final output of the model.

6 Conclusion

With the rapid development of big data, recommendation systems that allow people to quickly and accurately obtain valuable information from massive amounts of information continue to proliferate. However, problems still exist in that it is difficult to capture the rapid change in users’ interest, and the application of temporal information is not deep enough; therefore, in-depth study of sequence recommendation has certain challenges and practical value. In this paper, a sequence recommendation model, MSMT-LSI, is proposed, which focuses on maintaining a balance for more users’ long- and short-term benefits. Multiple forms of temporal embeddings are used to extract various patterns from users’ sequential behaviours, and users’ long-term and short-term preferences are captured through two multihead self-attention networks, which are ultimately integrated to form a hybrid representation for recommendation. Second, we find the most suitable parameter combinations for the model through parameter sensitivity analysis and verify the advantages of the long- and short-term fusion strategy. Finally, the MSMT-LSI model is demonstrated to outperform the classical comparison models on a benchmark dataset through performance experiments with different models and ablation experiments. Future work will focus on the detailed segmentation of user behaviour patterns and in-depth study of the cold start problem.