ABSTRACT
Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the symptoms and causes of performance anomalies. First, to identify anomaly symptoms, we construct change profiles that probabilistically characterize expected performance deviations between target and reference executions. By synthesizing several single-parameter change profiles, we can scalably identify anomalous reference-to-target changes in a complex system with multiple execution parameters. Second, to narrow the scope of anomaly root cause analysis, we filter anomaly-related low-level system metrics as those that manifest very differently between target and reference executions. Our anomaly identification approach requires little expert knowledge or detailed models on system internals and consequently it can be easily deployed. Using empirical case studies on the Linux I/O subsystem and a J2EE-based distributed online service, we demonstrate our approach's effectiveness in identifying performance anomalies over a wide range of execution conditions as well as multiple system software versions. In particular, we discovered five previously unknown performance anomaly causes in the Linux 2.6.23 kernel. Additionally, our preliminary results suggest that online anomaly detection and system reconfiguration may help evade performance anomalies in complex online systems.
- Realistic nonstationary online workloads. http://www.cs.rochester.edu/u/stewart/models.html.Google Scholar
- MySQL JDBC driver. http://www.mysql.com/products/connector.Google Scholar
- R.A. Fisher. The arrangement of field experiments. J. of the Ministry of Agriculture of Great Britain, 33:503--513, 1926.Google Scholar
- M. Grindal, J. Offutt, and S.F. Andler. Combination testing strategies: A survey. Software Testing, Verification and Reliability, 15(3):167--199, Mar. 2005.Google ScholarCross Ref
- S. Iyer and P. Druschel. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In 18th ACM Symp. on Operating Systems Principles, pages 117--130, Banff, Canada, Oct. 2001. Google ScholarDigital Library
- N. Joukov, A. Traeger, R. Iyer, C.P. Wright, and E. Zadok. Operating system profiling via latency analysis. In 7th USENIX Symp. on Operating Systems Design and Implementation, pages 89--102, Seattle, WA, Nov. 2006. Google ScholarDigital Library
- C. Li and K. Shen. Managing prefetch memory for data-intensive online servers. In 4th USENIX Conf. on File and Storage Technologies, pages 253--266, Dec. 2005. Google ScholarDigital Library
- C. Li, K. Shen, and A. Papathanasiou. Competitive prefetching for concurrent sequential I/O. In Second EuroSys Conf., pages 189--202, Lisbon, Portugal, Mar. 2007. Google ScholarDigital Library
- Linux kernel bug tracker. http://bugzilla.kernel.org/.Google Scholar
- Linux kernel bug tracker on "many pre-mature anticipation timeouts in anticipatory I/O scheduler". http://bugzilla.kernel.org/show_bug.cgi?id=10756.Google Scholar
- M.P. Mesnier, M. Wachs, R.R. Sambasivan, A.X. Zheng, and G.R. Ganger. Modeling the relative fitness of storage. In ACM SIGMETRICS, pages 37--48, San Diego, CA, June 2007. Google ScholarDigital Library
- P. Reynolds, C. Killian, J. Wiener, J. Mogul, M. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In Third USENIX Symp. on Networked Systems Design and Implementation, San Jose, CA, May 2006. Google ScholarDigital Library
- RUBiS: Rice University bidding system. http://rubis.objectweb.org.Google Scholar
- Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover's distance as a metric for image retrieval. Int'l J. of Computer Vision, 40(2):99--121, 2000. Google ScholarDigital Library
- K. Shen, M. Zhong, and C. Li. I/O system performance debugging using model-driven anomaly characterization. In 4th USENIX Conf. on File and Storage Technologies, pages 309--322, San Francisco, CA, Dec. 2005. Google ScholarDigital Library
- C. Stewart, T. Kelly, and A. Zhang. Exploiting nonstationarity for performance prediction. In Second EuroSys Conf., pages 31--44, Lisbon, Portugal, Mar. 2007. Google ScholarDigital Library
- C. Stewart and K. Shen. Performance modeling and system management for multi-component online services. In Second USENIX Symp. on Networked Systems Design and Implementation, pages 71--84, Boston, MA, May 2005. Google ScholarDigital Library
- E. Thereska and G.R. Ganger. IRONModel: Robust performance models in the wild. In ACM SIGMETRICS, pages 253--264, Annapolis, MD, June 2008. Google ScholarDigital Library
- A. Traeger, I. Deras, and E. Zadok. DARC: Dynamic analysis of root causes of latency distributions. In ACM SIGMETRICS, pages 277--288, Annapolis, MD, June 2008. Google ScholarDigital Library
- J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In 21th ACM Symp. on Operating Systems Principles, pages 131--144, Stevenson, WA, Oct. 2007. Google ScholarDigital Library
- H.J. Wang, J.C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with PeerPressure. In 6th USENIX Symp. on Operating Systems Design and Implementation, pages 245--258, San Francisco, CA, Dec. 2004. Google ScholarDigital Library
- A. Zeller. Isolating cause-effect chains from computer programs. In 10th ACM Symp. on Foundations of Software Engineering, pages 1--10, Charleston, SC, Nov. 2002. Google ScholarDigital Library
Index Terms
- Reference-driven performance anomaly identification
Recommendations
Reference-driven performance anomaly identification
SIGMETRICS '09Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the ...
A Performance Anomaly Detection and Analysis Framework for DBMS Development
Detecting performance anomalies and finding their root causes are tedious tasks requiring much manual work. Functionality enhancements in DBMS development as in most software development often introduce performance problems in addition to bugs. To ...
Performance Anomaly Detection and Bottleneck Identification
In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes, and take adequate corrective measures. The problem of uncovering and understanding ...
Comments