ABSTRACT
Automated, rapid, and effective fault management is a central goal of large operational IP networks. Today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to detect and localize, thereby delaying repair. One of the main challenges stems from operational reality: IP routing and the underlying optical fiber plant are typically described by disparate data models and housed in distinct network management systems. We introduce a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers. In particular, we apply SCORE to the problem of localizing link failures in IP and optical networks. In experiments conducted on a tier-1 ISP backbone, SCORE proved remarkably effective at localizing optical link failures using only IP-layer event logs. Moreover, SCORE was often able to automatically uncover inconsistencies in the databases that maintain the critical associations between the IP and optical networks.
- {1} S. Brugbosi, G. Bruno, et al. An Expert System for Real-Time Fault Diagnosis of the Italian Telecommunications Network. In Proc. 3rd International Symposium on Integrated Network Management, pages 617-628, 1993.]] Google ScholarDigital Library
- {2} J. Case, M. Fedor, M. Schoffstall, and J. Davin. A Simple Network Management Protocol (SNMP). In RFC 1157, May 1990.]] Google ScholarDigital Library
- {3} C. S. Chao, D. L. Yang, and A. C. Liu. An automated fault diagnosis system using hierarchical reasoning and alarm correlation. Journal of Network and Systems Management, 9(2):183-202, 2001.]] Google ScholarDigital Library
- {4} S. Chaudhuri, G. Hjalmtysson, and J. Yates. Control of lightpaths in an optical network, Jan. 2000. IETF draft-chaudhuri-ip-olxc-control-00.txt.]]Google Scholar
- {5} M. Chen, A. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer. A statistical learning approach to failure diagnosis. In Proc. International Conference on Autonomic Computing (ICAC-04), May 2004.]]Google Scholar
- {6} R. H. Deng, A. A. Lazar, and W. Wang. A probabilistic approach to fault diagnosis in linear lightwave networks. In Integrated Network Management III, pages 697-708, Apr. 1993.]] Google ScholarDigital Library
- {7} G. Forman, M. Jain, M. Mansouri-Samani, J. Martinka, and A. C. Snoeren. Automated whole-system diagnosis of distributed services using model-based reasoning. In Proc. 9th IFIP/IEEE Workshop on Distributed Systems: Operations and Management, Oct. 1998.]]Google Scholar
- {8} GenSym. Integrity. http://www.gensym. com.]]Google Scholar
- {9} B. Gruschke. Integrated event management: Event correlation using dependency graphs. In Proc. 9th IFIP/IEEE Workshop on Distributed Systems: Operations and Management, Oct. 1998.]]Google Scholar
- {10} P. Hong and P. Sen. Incorporating nondeterministic reasoning in managing heterogeneous network. In Integrated Network Management II, pages 481-492, Apr. 1991.]]Google Scholar
- {11} K. Houck, S. Calo, and A. Finkel. Towards a Practical Alarm Correlation System. In Proc. 4th IEEE/IFIP Symposium on Integrated Network Management, 1995.]] Google ScholarDigital Library
- {12} HP Technologies Inc. Open View. http://www. openview.hp.com.]]Google Scholar
- {13} G. Jakobson and M. D. Weissman. Alarm correlation. IEEE Network, 7(6):52-59, Nov. 1993.]]Google ScholarDigital Library
- {14} I. P. Kaminow and T. L. Koch. Optical Fiber Telecommunications IIIA. Academic Press, 1997.]]Google Scholar
- {15} G. Liu, A. K. Mok, and E. J. Yang. Composite events for network event correlation. In Integrated Network Management VI, pages 247-260, 1999.]]Google ScholarCross Ref
- {16} Y. A. Nygate. Event correlation using rule and object based techniques. In Integrated Network Management, IV, pages 278-289.]] Google ScholarDigital Library
- {17} P. Wu, R. Bhatnagar, L. Epshtein, M. Bhandaru, and Z. Shi. Alarm correlation engine (ACE). In Proc. Network Operation and Management Symposium, pages 733-742, 1998.]]Google Scholar
- {18} R. Ramaswami and K. Sivarajan. Optical Networks: A Practical Perspective. Academic Press/Morgan Kaufmann, Feb. 1998.]] Google ScholarDigital Library
- {19} M. Roughan, T. Griffin, Z. M. Mao, A. Greenberg, and B. Freeman. Combining routing and traffic data for detection of ip forwarding anomalies. In Proc. ACM SIGCOMM NeTs Workshop, Aug. 2004.]]Google ScholarDigital Library
- {20} P. Sebos, J. Yates, D. Rubenstein, and A. Greenberg. Effectiveness of shared risk link group autodiscovery in optical networks. In Proc. Optical Fiber Communications Conference., Mar. 2002.]]Google Scholar
- {21} SMARTS. InCharge. http://www.smarts. com.]]Google Scholar
- {22} M. Steinder and A. Sethi. End-to-end Service Failure Diagnosis Using Belief Networks. In Proc. Network Operation and Management Symposium, Florence, Italy, Apr. 2002.]]Google ScholarCross Ref
- {23} J. Strand, A. Chiu, and R. Tkach. Issues for routing in the optical layer. In IEEE Communications, Feb. 2001.]]Google ScholarDigital Library
- {24} H. Wietgrefe, K. Tochs, et al. Using Neural Networks for Alarm Correlation in Cellular Phone Networks. In Proc. International Workshop on Applications of Neural Networks in Telecommunications, 1997.]]Google Scholar
- {25} S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications, 34(5):82-90, 1996.]]Google ScholarDigital Library
- {26} Y. Zhang, M. Roughan, C. Lund, and D. Donoho. An Information-Theoretic Approach to Traffic Matrix Estimation. In Proc. ACM SIGCOMM, Aug. 2003.]] Google ScholarDigital Library
Index Terms
- IP fault localization via risk modeling
Recommendations
Fault management in IP-over-WDM networks: WDM protection versus IP restoration
We consider an IP-over-WDM network in which network nodes employ optical crossconnects and IP routers. Nodes are connected by fibers to form a mesh topology. Any two IP routers in this network can be connected together by an all-optical wavelength-...
IP multicast fault recovery in PIM over OSPF
ICNP '00: Proceedings of the 2000 International Conference on Network ProtocolsLittle attention has been given to understanding the fault recovery characteristics and performance tuning of native IP multicast networks. This paper focuses on the interactions of the component protocols to understand their behavior in network failure ...
A multilayer fault localization framework for IP over all-optical multilayer networks
Special issue on protocols and algorithms for future cross-layer and hybrid optical networksIP over all-optical multilayer networks is a promising solution to combine high-speed transmission using end-to-end lightpaths with the flexibility required to handle traffic increases and fluctuations at layer-2 (e.g., Ethernet) and/or layer-3 (e.g., ...
Comments