Lengthy-Time period Forecasting utilizing Transformers might not be the way in which to go
In recent times, Transformer-based options have been gaining unbelievable recognition. With the success of BERT, GTP, and different language transformers researchers began to use this structure to different sequential-modeling issues, particularly within the space of time sequence forecasting (also referred to as Lengthy-Time period Time Collection Forecasting or LTSF). The eye mechanism appeared to be an ideal technique to extract among the long-term correlations current in lengthy sequences.
Nevertheless, researchers from the Chinese language College of Hong Kong and the Worldwide Digital Financial system Exercise lately determined to query: Are Transformers Efficient for Time Collection Forecasting ? They present that self-attention mechanisms (even with positional encoding) may end up in temporal info loss. They then validate this declare with a set of one-layer linear fashions which outperform the transformer benchmarks in nearly each experiment.
In less complicated phrases, Transformers might not be probably the most ultimate structure for forecasting issues.
On this publish, I goal to summarize the findings and experiments of Zeng et al.  that result in this conclusion and talk about some potential implications of the work. All of the experiments and fashions developed by the authors could be discovered of their GitHub repository as effectively. Moreover, I extremely encourage everybody to learn the unique paper.
The Fashions and Information
Of their work, the authors evaluated 5 totally different SOTA Transformer fashions on the Electrical energy Transformer Dataset (ETDataset) . These fashions and a few of their important options are as follows:
- LogTrans : Proposes convolutional self-attention so native context could be higher included into the eye mechanism. The mannequin additionally encodes a sparsity bias into the eye scheme. This helps enhance the reminiscence complexity
- Informer : Addresses reminiscence/time complexity and error complexity points attributable to an auto-regressive decoder by proposing a brand new structure and a direct-multi-step (DMS) forecasting technique.
- Autoformer : Applies a seasonal-trend decomposition behind every neural block to extract the trend-cyclical elements. Moreover, Autoformer designs a series-wise auto-correlation mechanism to switch vanilla self-attention.
- Pyraformer : Implements a novel pyramidal consideration mechanism that captures hierarchical multi-scale temporal dependencies. Like LogTrans, this mannequin additionally explicitly encodes a sparsity bias into the eye scheme.
- FEDFormer : Enhances the normal transformer structure by incorporating seasonal-trend decomposition strategies into the structure, successfully growing a Frequency-Enhanced Decomposed TransFormer.
These fashions all make varied adjustments to varied items of the transformer structure to handle varied totally different issues with conventional transformers (a full abstract could be present in determine 1)
To compete towards these transformer fashions, the authors proposed some “embarrassingly easy” fashions  that carry out DMS predictions.
These fashions and their properties are:
- Decomposed Linear (D-Linear): D-Linear makes use of a decomposition scheme to separate uncooked knowledge right into a development and seasonal element. Two single-layer linear networks are then utilized to every element and the outputs are summed to get the ultimate prediction.
- Normalized Linear (N-Linear): N-Linear first subtracts the enter by the final worth of the sequence. The enter is then handed right into a single linear layer and the subtracted half is added in earlier than making a remaining prediction. This helps handle distribution shifts within the knowledge.
- Repeat: Simply repeat the final worth within the look-back window.
These are some quite simple baselines. The Linear fashions each contain a small quantity of knowledge preprocessing and a single-layer community. The Repeat is a trivial baseline.
The experiments have been carried out with varied widely-used datasets just like the Electrical energy Transformer (ETDataset), Site visitors, Electrical energy, Climate, ILI, and Trade Charge  datasets.
On the 8 fashions above, the authors carried out a sequence of experiments to guage the fashions’ performances and decide the impression of varied elements of every mannequin on the tip predictions.
The primary experiment was easy: every mannequin was skilled and used to forecast the info. The look-back durations have been different as effectively. The complete testing outcomes could be present in desk 1 however in abstract, FEDFormer  was the best-performing transformer most often however was by no means the general greatest performer.
This embarrassing efficiency of transformers could be seen within the predictions for the Electrical energy, Trade-Charge, and ETDataset in determine 3.
Quoting the authors:
Transformers [28, 30, 31] fail to seize the dimensions and bias of the longer term knowledge on Electrical energy and ETTh2. Furthermore, they will hardly predict a correct development on aperiodic knowledge reminiscent of Trade-Charge. These phenomena additional point out the inadequacy of current Transformer-based options for the LTSF process.
Many would argue nevertheless that that is unfair to transformers as consideration mechanisms are normally good at preserving long-range info so Transformers ought to carry out higher with longer enter sequences, and the authors take a look at this speculation of their subsequent experiment. They differ the look-back interval between 24 and 720 time steps and consider the MSE. The authors discovered that in lots of instances, the efficiency of the transformers didn’t enhance and the error really will increase for a number of fashions (view determine 4 for full outcomes). As compared, the efficiency of the Linear fashions considerably improved with the inclusion of extra time steps.
There are nonetheless different components to contemplate, nevertheless. Because of the complexity of transformers, they usually require bigger coaching knowledge units than different fashions as a way to carry out effectively and in consequence, the authors determined to check whether or not or not coaching knowledge measurement is a limiting issue for these transformer architectures. They leveraged the Site visitors knowledge  and skilled Autoformer  and FEDformer  on the unique set in addition to a truncated set with the expectation that the errors might be larger with the smaller coaching set. Surprisingly, the fashions skilled on the smaller coaching set carried out marginally higher. Whereas this doesn’t imply that one ought to use a smaller coaching set, this does imply that knowledge set measurement is just not a limiting issue for LTSF Transformers.
Together with various the coaching knowledge measurement and look-back interval measurement, the authors additionally experimented with various what timesteps the lookback window began at. For instance, in the event that they have been trying to make a prediction for the interval after t=196, as a substitute of utilizing t = 100, 101,…, 196 (the adjoining or “shut” window) the authors tried utilizing t = 4, 5,…, 100 (the “far” window). The concept is that forecasting ought to rely on whether or not the mannequin can seize development and periodicity effectively and the farther the horizon is, the more severe the prediction needs to be. The authors found that the efficiency of the transformers solely drops barely between the “shut” and “far” home windows. This suggests that the transformers could also be overfitting to the supplied knowledge, which might clarify why the Linear fashions carry out higher.
After evaluating the varied transformer fashions, the authors additionally dived particularly into the effectiveness of self-attention and embedding methods utilized by these fashions. Their first experiment concerned disassembling current transformers to research whether or not or not the complicated design of the transformer was essential. They broke the eye layer down right into a easy linear layer, then eliminated auxiliary items other than the embedding mechanisms, and eventually diminished the transformer all the way down to solely linear layers. At every step, they recorded the MSE utilizing varied look-back interval sizes and located that the efficiency of the transformer grows with the gradual simplification.
The authors additionally needed to look at the impression of the transformers to protect temporal order. They hypothesized that since self-attention is permutation-invariant (ignores order) and time sequence are permutation-sensitive, positional encoding and self-attention won’t be sufficient to seize temporal info. To check this, the authors modified the sequences by shuffling the info and exchanging the primary half of the enter sequence with the second half. The extra temporal info is captured by the mannequin, the extra the efficiency of the mannequin ought to lower with the modified units. The authors noticed that the linear fashions had a better efficiency drop than any of the transformer fashions, suggesting that the transformers are capturing much less temporal info than the linear fashions. The complete outcomes could be discovered within the desk beneath
To additional dive into the information-capturing capabilities of transformers, the authors examined the effectiveness of various encoding methods by eradicating positional and temporal encoding from the transformers. These outcomes have been blended relying on the mannequin. For FEDFormer  and Autoformer , eradicating positional encoding improved the efficiency on the Site visitors dataset on most look-back window sizes. Nevertheless, Informer  did carry out the very best when it had all its positional encodings.
Dialogue and Conclusion
There are a number of factors to watch out of when understanding these outcomes. Transformers are very delicate to hyperparameters and sometimes require a variety of tuning to successfully mannequin the issue. Nevertheless, the authors don’t carry out any sort of hyperparameter search when implementing these fashions, as a substitute opting to make use of the default parameters utilized by the implementation of the fashions. There’s an argument to be made that additionally they didn’t tune the linear fashions, so the comparability is truthful. Moreover, tuning the linear fashions would take considerably much less time than coaching the transformers because of the simplicity of the linear fashions. Regardless of this, there may very well be issues the place transformers work extremely effectively with the proper hyperparameters, and value and time could be ignored for accuracy.
Regardless of these critiques, the experiments accomplished by the authors element a transparent breakdown of the failings of transformers. These are giant, very complicated fashions that overfit simply on time sequence knowledge. Whereas they work effectively for language processing and different duties, the permutation-invariant nature of self-attention does trigger vital temporal loss. Moreover, a linear mannequin is extremely interpretable and explainable in comparison with the difficult structure of a Transformer. If some modifications are made to those elements of LTSF Transformers, we may even see them finally beat easy linear fashions or deal with issues linear fashions are unhealthy at modeling (for instance change level identification). Within the meantime, nevertheless, knowledge scientists and decision-makers shouldn’t blindly throw Transformers at a time-series forecasting downside with out having superb causes for leveraging this structure.
Assets and References
 A. Zeng, M. Chen, L. Zhang, Q. Xu. Are Transformers Efficient for Time Collection Forecasting? (2022). Thirty-Seventh AAAI Convention on Synthetic Intelligence.
 S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, X. Yan. Enhancing the Locality and Breaking the Reminiscence Bottleneck of Transformer on Time Collection Forecasting (2019). Advances in Neural Info Processing methods 32.
 H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang. Informer: Past Environment friendly Transformer for Lengthy Sequence Time-Collection Forecasting (2021). The Thirty-Fifth AAAI Convention on Synthetic Intelligence, Digital Convention.
 H. Wu, J. Xu, J. Wang, M. Lengthy. Autoformer: Decomposition Transformers with Auto-Correlation for Lengthy-Time period Collection Forecasting (2021). Advances in Neural Info Processing Programs 34.
 S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A.X. Liu, S. Dustdar. Pyraformer: Low-Complexity Pyramidal Consideration for Lengthy-Vary Time Collection Modeling and Forecasting (2021). Worldwide Convention on Studying Representations 2021.
 T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Solar, R. Jin. FEDformer: Frequency Enhanced Decomposed Transformer for Lengthy-term Collection Forecasting (2022). thirty ninth Worldwide Convention on Machine Studying.
 G. Lai, W-C. Chang, Y. Yang, and H. Liu. Modeling Lengthy- and Brief-Time period Temporal Patterns with Deep Neural Networks (2017). forty first Worldwide ACM SIGIR Convention on Analysis and Improvement in Info Retrieval.