research-article

Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases

Authors:
Jiaqi Wang

Zhejiang University

Zhejiang University
View Profile

,
Tianyi Li

Aalborg University

Aalborg University
View Profile

,
Anni Wang

Alibaba Group

Alibaba Group
View Profile

,
Xiaoze Liu

Purdue University

Purdue University
View Profile

,
Lu Chen

Zhejiang University

Zhejiang University
View Profile

,
Jie Chen

Alibaba Group

Alibaba Group
View Profile

,
Jianye Liu

Alibaba Group

Alibaba Group
View Profile

,
Junyang Wu

Zhejiang University

Zhejiang University
View Profile

,
Feifei Li

Alibaba Group

Alibaba Group
View Profile

,
Yunjun Gao

Zhejiang University

Zhejiang University
View Profile

Authors Info & Claims

Proceedings of the VLDB Endowment Volume 16 Issue 12pp 3689–3701https://doi.org/10.14778/3611540.3611557

Published:01 August 2023Publication History

Proceedings of the VLDB Endowment

Abstract

Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases.

In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AW M encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded query by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AW M enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.

References

Alibaba Cloud. 2022. Alibaba Cloud Databases. https://www.alibabacloud.com/product/databasesGoogle Scholar
Amazon EC. 2015. Amazon web services. http://aws.amazon.com/es/ec2/Google Scholar
Wei Cao, Xiaojie Feng, Boyuan Liang, Tianyu Zhang, Yusong Gao, Yunyang Zhang, and Feifei Li. 2021. LogStore: A Cloud-Native and Multi-Tenant Log Database. In SIGMOD. 2464--2476.Google Scholar
Bikash Chandra, Bhupesh Chawda, Biplab Kar, KV Reddy, Shetal Shah, and S Sudarshan. 2015. Data generation for testing and grading SQL queries. VLDBJ 24, 6 (2015), 731--755.Google ScholarDigital Library
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD. 785--794.Google Scholar
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In ACL. 8440--8451.Google Scholar
Marshall Copeland, Julian Soh, Anthony Puca, Mike Manning, and David Gollob. 2015. Microsoft Azure: planning, deploying, and managing your data center in the cloud. Apress.Google Scholar
Guilherme Damasio, Vincent Corvinelli, Parke Godfrey, Piotr Mierzejewski, Alex Mihaylov, Jaroslaw Szlichta, and Calisto Zuzarte. 2019. Guided automated learning for query workload re-optimization. PVLDB 12, 12 (2019), 2010--2021.Google ScholarDigital Library
Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R. Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri. 2019. Automatically Indexing Millions of Databases in Microsoft Azure SQL Database. In SIGMOD. 666--679.Google Scholar
Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey Naughton, and Stratis Viglas. 2020. Comprehensive and efficient workload compression. PVLDB 14, 3 (2020), 418--430.Google ScholarDigital Library
Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. PVLDB 2, 1 (2009), 1246--1257.Google ScholarDigital Library
Mehrad Eslami, Yicheng Tu, Hadi Charkhgard, Zichen Xu, and Jiacheng Liu. 2019. PsiDB: A framework for batched query processing and optimization. In IEEE BigData. 6046--6048.Google Scholar
Yunjun Gao, Xiaoze Liu, Junyang Wu, Tianyi Li, Pengfei Wang, and Lu Chen. 2022. ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. In KDD. 421--431.Google Scholar
Congcong Ge, Xiaoze Liu, Lu Chen, Baihua Zheng, and Yunjun Gao. 2021. Make It Easy: An Effective End-to-End Entity Alignment Framework. In SIGIR. 777--786.Google Scholar
Congcong Ge, Xiaoze Liu, Lu Chen, Baihua Zheng, and Yunjun Gao. 2022. LargeEA: Aligning Entities for Large-scale Knowledge Graphs. PVLDB 15, 2 (2022), 237--245.Google Scholar
Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration. TKDE (2021), 1--14.Google Scholar
Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and Philip S Yu. 2003. Mining frequent patterns in data streams at multiple time granularities. Next generation data mining 212 (2003), 191--212.Google Scholar
Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2013. Workload optimization using shareddb. In SIGMOD. 1045--1048.Google Scholar
Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2014. Shared workload optimization. PVLDB 7, 6 (2014), 429--440.Google ScholarDigital Library
Peter D Grünwald. 2007. The minimum description length principle. MIT press.Google Scholar
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB 4, 11 (2011), 1111--1122.Google ScholarDigital Library
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In ACL. 2073--2083.Google Scholar
Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. 2018. Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics. PVLDB 11, 5 (2018).Google Scholar
Ruoming Jin and Gagan Agrawal. 2007. Frequent pattern mining in data streams. Data streams: Models and algorithms (2007), 61--84.Google Scholar
Oliver Kennedy, Jerry Ajay, Geoffrey Challen, and Lukasz Ziarek. 2015. Pocket data: The need for TPC-MOBILE. In TPCTC. Springer, 8--25.Google Scholar
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.Google Scholar
S. P. T. Krishnan and Jose L Ugia Gonzalez. 2015. Building your next big thing with google cloud platform: A guide for developers and enterprise architects. Springer.Google Scholar
Gokhan Kul, Duc Thanh Anh Luong, Ting Xie, Varun Chandola, Oliver Kennedy, and Shambhu Upadhyaya. 2018. Similarity metrics for SQL query clustering. TKDE 30, 12 (2018), 2408--2420.Google ScholarDigital Library
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188--1196.Google Scholar
Guoliang Li, Xuanhe Zhou, Ji Sun, Xiang Yu, Yue Han, Lianyuan Jin, Wenbo Li, Tianqing Wang, and Shifu Li. 2021. openGauss: An Autonomous Database System. PVLDB 14, 12 (2021), 3028--3041.Google Scholar
Tianyi Li, Lu Chen, Christian S Jensen, and Torben Bach Pedersen. 2021. TRACE: Real-time compression of streaming trajectories in road networks. PVLDB 14, 7 (2021), 1175--1187.Google ScholarDigital Library
Tianyi Li, Ruikai Huang, Lu Chen, Christian S Jensen, and Torben Bach Pedersen. 2020. Compression of uncertain trajectories in road networks. PVLDB 13, 7 (2020), 1050--1063.Google ScholarDigital Library
Xiaoze Liu, Junyang Wu, Tianyi Li, Lu Chen, and Yunjun Gao. 2023. Unsupervised Entity Alignment for Temporal Knowledge Graphs. In WWW. 2528--2538.Google Scholar
Xiaoze Liu, Zheng Yin, Chao Zhao, Congcong Ge, Lu Chen, Yunjun Gao, Dimeng Li, Ziting Wang, Gaozhong Liang, Jian Tan, and Feifei Li. 2022. PinSQL: Pinpoint Root Cause SQLs to Resolve Performance Issues in Cloud Databases. In ICDE. 2549--2561.Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. 2023. A survey of visual transformers. TNNLS (2023), 1--21.Google Scholar
Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based Workload Forecasting for Self-Driving Database Management Systems. In SIGMOD. 631--645.Google Scholar
Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu, Feifei Li, Changcheng Chen, and Dan Pei. 2020. Diagnosing Root Causes of Intermittent Slow Queries in Large-Scale Cloud Databases. PVLDB 13, 8 (2020), 1176--1189.Google ScholarDigital Library
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Making Learned Query Optimization Practical. In SIGMOD. 1275--1288.Google Scholar
Ryan Marcus and Olga Papaemmanouil. 2016. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB 9, 10 (2016), 780--791.Google ScholarDigital Library
Ryan C. Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. PVLDB 12, 11 (2019), 1705--1718.Google ScholarDigital Library
Barzan Mozafari, Carlo Curino, Alekh Jindal, and Samuel Madden. 2013. Performance and resource modeling in highly-concurrent OLTP workloads. In SIGMOD. 301--312.Google Scholar
Debjyoti Paul, Jie Cao, Feifei Li, and Vivek Srikumar. 2021. Database workload characterization with query plan encoders. PVLDB 15, 4 (2021), 923--935.Google ScholarDigital Library
Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C Mowry, Matthew Perron, Ian Quah, et al. 2017. Self-Driving Database Management Systems. In CIDR, Vol. 4. 1.Google Scholar
Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, and Raghu Ramakrishnan. 2022. OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs. arXiv preprint arXiv:2210.14047 (2022).Google Scholar
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. 3980--3990.Google Scholar
Leonard Richardson and Sam Ruby. 2008. RESTful web services. " O'Reilly Media, Inc.".Google Scholar
Xiu Tang, Sai Wu, Mingli Song, Shanshan Ying, Feifei Li, and Gang Chen. 2022. PreQR: Pre-training Representation for SQL Understanding. In SIGMOD. 204--216.Google Scholar
Quoc Trung Tran, Konstantinos Morfonios, and Neoklis Polyzotis. 2015. Oracle Workload Intelligence. In SIGMOD. 1669--1681.Google Scholar
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In SIGMOD. 1041--1052.Google Scholar
Junyang Wu, Tianyi Li, Lu Chen, Yunjun Gao, and Ziheng Wei. 2023. SEA: A Scalable Entity Alignment System. arXiv preprint arXiv:2304.07065 (2023).Google Scholar
Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. DBSherlock: A Performance Diagnostic Tool for Transactional Databases. In SIGMOD. 1599--1614.Google Scholar
Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A Learned Query Rewrite System using Monte Carlo Tree Search. PVLDB 15, 1 (2021), 46--58.Google ScholarDigital Library
Rong Zhu, Ziniu Wu, Chengliang Chai, Andreas Pfadler, Bolin Ding, Guoliang Li, and Jingren Zhou. 2022. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In EDBT. 1--4.Google Scholar
Yiwen Zhu, Subru Krishnan, Konstantinos Karanasos, Isha Tarte, Conor Power, Abhishek Modi, Manoj Kumar, Deli Zhang, Kartheek Muthyala, Nick Jurgens, et al. 2021. Kea: Tuning an exabyte-scale data infrastructure. In SIGMOD. 2667--2680.Google Scholar
Zainab Zolaktaf, Mostafa Milani, and Rachel Pottinger. 2020. Facilitating SQL query composition and analysis. In SIGMOD. 209--224.Google Scholar

Recommendations

Efficient closed high-utility pattern fusion model in large-scale databases
Abstract
High-Utility Itemset Mining (HUIM) is considered a major issue in recent decades since it reveals profit strategies for use in industry for decision-making. Most existing works have focused on mining high-utility itemsets from ...
Highlights
- Mine required CHUIs in parallel and distributed environments.
- Use HG-k-means to ...
Read More
Cloud databases: new techniques, challenges, and opportunities

As database vendors are increasingly moving towards the cloud data service, i.e., databases as a service (DBaaS), cloud databases have become prevalent. Compared with the early cloud-hosted databases, the new generation of cloud databases, also known as ...
Read More
Large science databases - are cloud services ready for them?
Science-Driven Cloud Computing

We report on attempts to put an astronomical database - the Sloan Digital Sky Survey science archive - in the cloud. We find that it is very frustrating to impossible at this time to migrate a complex SQL Server database into current cloud service ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 16, Issue 12
August 2023
685 pages
ISSN:2150-8097
Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2023
Published in pvldb Volume 16, Issue 12

Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 54
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Efficient closed high-utility pattern fusion model in large-scale databases

Cloud databases: new techniques, challenges, and opportunities

Large science databases - are cloud services ready for them?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Efficient closed high-utility pattern fusion model in large-scale databases

Cloud databases: new techniques, challenges, and opportunities

Large science databases - are cloud services ready for them?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media