Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1251203.1251208acmconferencesArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

IP fault localization via risk modeling

Published:02 May 2005Publication History

ABSTRACT

Automated, rapid, and effective fault management is a central goal of large operational IP networks. Today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to detect and localize, thereby delaying repair. One of the main challenges stems from operational reality: IP routing and the underlying optical fiber plant are typically described by disparate data models and housed in distinct network management systems. We introduce a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers. In particular, we apply SCORE to the problem of localizing link failures in IP and optical networks. In experiments conducted on a tier-1 ISP backbone, SCORE proved remarkably effective at localizing optical link failures using only IP-layer event logs. Moreover, SCORE was often able to automatically uncover inconsistencies in the databases that maintain the critical associations between the IP and optical networks.

References

  1. {1} S. Brugbosi, G. Bruno, et al. An Expert System for Real-Time Fault Diagnosis of the Italian Telecommunications Network. In Proc. 3rd International Symposium on Integrated Network Management, pages 617-628, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} J. Case, M. Fedor, M. Schoffstall, and J. Davin. A Simple Network Management Protocol (SNMP). In RFC 1157, May 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. {3} C. S. Chao, D. L. Yang, and A. C. Liu. An automated fault diagnosis system using hierarchical reasoning and alarm correlation. Journal of Network and Systems Management, 9(2):183-202, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} S. Chaudhuri, G. Hjalmtysson, and J. Yates. Control of lightpaths in an optical network, Jan. 2000. IETF draft-chaudhuri-ip-olxc-control-00.txt.]]Google ScholarGoogle Scholar
  5. {5} M. Chen, A. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer. A statistical learning approach to failure diagnosis. In Proc. International Conference on Autonomic Computing (ICAC-04), May 2004.]]Google ScholarGoogle Scholar
  6. {6} R. H. Deng, A. A. Lazar, and W. Wang. A probabilistic approach to fault diagnosis in linear lightwave networks. In Integrated Network Management III, pages 697-708, Apr. 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} G. Forman, M. Jain, M. Mansouri-Samani, J. Martinka, and A. C. Snoeren. Automated whole-system diagnosis of distributed services using model-based reasoning. In Proc. 9th IFIP/IEEE Workshop on Distributed Systems: Operations and Management, Oct. 1998.]]Google ScholarGoogle Scholar
  8. {8} GenSym. Integrity. http://www.gensym. com.]]Google ScholarGoogle Scholar
  9. {9} B. Gruschke. Integrated event management: Event correlation using dependency graphs. In Proc. 9th IFIP/IEEE Workshop on Distributed Systems: Operations and Management, Oct. 1998.]]Google ScholarGoogle Scholar
  10. {10} P. Hong and P. Sen. Incorporating nondeterministic reasoning in managing heterogeneous network. In Integrated Network Management II, pages 481-492, Apr. 1991.]]Google ScholarGoogle Scholar
  11. {11} K. Houck, S. Calo, and A. Finkel. Towards a Practical Alarm Correlation System. In Proc. 4th IEEE/IFIP Symposium on Integrated Network Management, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} HP Technologies Inc. Open View. http://www. openview.hp.com.]]Google ScholarGoogle Scholar
  13. {13} G. Jakobson and M. D. Weissman. Alarm correlation. IEEE Network, 7(6):52-59, Nov. 1993.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} I. P. Kaminow and T. L. Koch. Optical Fiber Telecommunications IIIA. Academic Press, 1997.]]Google ScholarGoogle Scholar
  15. {15} G. Liu, A. K. Mok, and E. J. Yang. Composite events for network event correlation. In Integrated Network Management VI, pages 247-260, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. {16} Y. A. Nygate. Event correlation using rule and object based techniques. In Integrated Network Management, IV, pages 278-289.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} P. Wu, R. Bhatnagar, L. Epshtein, M. Bhandaru, and Z. Shi. Alarm correlation engine (ACE). In Proc. Network Operation and Management Symposium, pages 733-742, 1998.]]Google ScholarGoogle Scholar
  18. {18} R. Ramaswami and K. Sivarajan. Optical Networks: A Practical Perspective. Academic Press/Morgan Kaufmann, Feb. 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} M. Roughan, T. Griffin, Z. M. Mao, A. Greenberg, and B. Freeman. Combining routing and traffic data for detection of ip forwarding anomalies. In Proc. ACM SIGCOMM NeTs Workshop, Aug. 2004.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} P. Sebos, J. Yates, D. Rubenstein, and A. Greenberg. Effectiveness of shared risk link group autodiscovery in optical networks. In Proc. Optical Fiber Communications Conference., Mar. 2002.]]Google ScholarGoogle Scholar
  21. {21} SMARTS. InCharge. http://www.smarts. com.]]Google ScholarGoogle Scholar
  22. {22} M. Steinder and A. Sethi. End-to-end Service Failure Diagnosis Using Belief Networks. In Proc. Network Operation and Management Symposium, Florence, Italy, Apr. 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. {23} J. Strand, A. Chiu, and R. Tkach. Issues for routing in the optical layer. In IEEE Communications, Feb. 2001.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} H. Wietgrefe, K. Tochs, et al. Using Neural Networks for Alarm Correlation in Cellular Phone Networks. In Proc. International Workshop on Applications of Neural Networks in Telecommunications, 1997.]]Google ScholarGoogle Scholar
  25. {25} S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications, 34(5):82-90, 1996.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. {26} Y. Zhang, M. Roughan, C. Lund, and D. Donoho. An Information-Theoretic Approach to Traffic Matrix Estimation. In Proc. ACM SIGCOMM, Aug. 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. IP fault localization via risk modeling

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              NSDI'05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
              May 2005
              356 pages

              Publisher

              USENIX Association

              United States

              Publication History

              • Published: 2 May 2005

              Check for updates

              Qualifiers

              • Article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader