Search | arXiv e-print repository

Mutiny! How does Kubernetes fail, and what can we do about it?

Authors: Marco Barletta, Marcello Cinque, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Abstract: In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many… ▽ More In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.08509 [pdf, other]

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Authors: Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer

Abstract: Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line bloc… ▽ More Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: Accepted at AIOps'24

arXiv:2404.00869 [pdf, other]

Towards Automated Generation of Smart Grid Cyber Range for Cybersecurity Experiments and Training

Authors: Daisuke Mashima, Muhammad M. Roomi, Bennet Ng, Zbigniew Kalbarczyk, S. M. Suhail Hussain, Ee-chien Chang

Abstract: Assurance of cybersecurity is crucial to ensure dependability and resilience of smart power grid systems. In order to evaluate the impact of potential cyber attacks, to assess deployability and effectiveness of cybersecurity measures, and to enable hands-on exercise and training of personals, an interactive, virtual environment that emulates the behaviour of a smart grid system, namely smart grid… ▽ More Assurance of cybersecurity is crucial to ensure dependability and resilience of smart power grid systems. In order to evaluate the impact of potential cyber attacks, to assess deployability and effectiveness of cybersecurity measures, and to enable hands-on exercise and training of personals, an interactive, virtual environment that emulates the behaviour of a smart grid system, namely smart grid cyber range, has been demanded by industry players as well as academia. A smart grid cyber range is typically implemented as a combination of cyber system emulation, which allows interactivity, and physical system (i.e., power grid) simulation that are tightly coupled for consistent cyber and physical behaviours. However, its design and implementation require intensive expertise and efforts in cyber and physical aspects of smart power systems as well as software/system engineering. While many industry players, including power grid operators, device vendors, research and education sectors are interested, availability of the smart grid cyber range is limited to a small number of research labs. To address this challenge, we have developed a framework for modelling a smart grid cyber range using an XML-based language, called SG-ML, and for "compiling" the model into an operational cyber range with minimal engineering efforts. The modelling language includes standardized schema from IEC 61850 and IEC 61131, which allows industry players to utilize their existing configurations. The SG-ML framework aims at making a smart grid cyber range available to broader user bases to facilitate cybersecurity R\&D and hands-on exercises. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Published at DSN 2023 Industry Track

arXiv:2403.07890 [pdf, other]

$\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games

Authors: Weichao Mao, Haoran Qiu, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar

Abstract: No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analog… ▽ More No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings. △ Less

Submitted 23 April, 2024; v1 submitted 2 February, 2024; originally announced March 2024.

arXiv:2403.05448 [pdf, other]

On Practicality of Using ARM TrustZone Trusted Execution Environment for Securing Programmable Logic Controllers

Authors: Zhiang Li, Daisuke Mashima, Wen Shei Ong, Ertem Esiner, Zbigniew Kalbarczyk, Ee-Chien Chang

Abstract: Programmable logic controllers (PLCs) are crucial devices for implementing automated control in various industrial control systems (ICS), such as smart power grids, water treatment systems, manufacturing, and transportation systems. Owing to their importance, PLCs are often the target of cyber attackers that are aiming at disrupting the operation of ICS, including the nation's critical infrastruct… ▽ More Programmable logic controllers (PLCs) are crucial devices for implementing automated control in various industrial control systems (ICS), such as smart power grids, water treatment systems, manufacturing, and transportation systems. Owing to their importance, PLCs are often the target of cyber attackers that are aiming at disrupting the operation of ICS, including the nation's critical infrastructure, by compromising the integrity of control logic execution. While a wide range of cybersecurity solutions for ICS have been proposed, they cannot counter strong adversaries with a foothold on the PLC devices, which could manipulate memory, I/O interface, or PLC logic itself. These days, many ICS devices in the market, including PLCs, run on ARM-based processors, and there is a promising security technology called ARM TrustZone, to offer a Trusted Execution Environment (TEE) on embedded devices. Envisioning that such a hardware-assisted security feature becomes available for ICS devices in the near future, this paper investigates the application of the ARM TrustZone TEE technology for enhancing the security of PLC. Our aim is to evaluate the feasibility and practicality of the TEE-based PLCs through the proof-of-concept design and implementation using open-source software such as OP-TEE and OpenPLC. Our evaluation assesses the performance and resource consumption in real-world ICS configurations, and based on the results, we discuss bottlenecks in the OP-TEE secure OS towards a large-scale ICS and desired changes for its application on ICS devices. Our implementation is made available to public for further study and research. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: To appear at ACM AsiaCCS 2024

arXiv:2206.00886 [pdf, other]

Watch Out for the Safety-Threatening Actors: Proactively Mitigating Safety Hazards

Authors: Saurabh Jha, Shengkun Cui, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Abstract: Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe… ▽ More Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe navigational choices for the AV. In this work, we propose a safety threat indicator (STI) using counterfactual reasoning to estimate the importance of each actor on the road with respect to its influence on the AV's safety. We use this indicator to (i) characterize the existing real-world datasets to identify rare hazardous scenarios as well as the poor performance of existing controllers in such scenarios; and (ii) design an RL based safety mitigation controller to proactively mitigate the safety hazards those actors pose to the AV. Our approach reduces the accident rate for the state-of-the-art AV agent(s) in rare hazardous scenarios by more than 70%. △ Less

Submitted 2 June, 2022; originally announced June 2022.

arXiv:2110.09998 [pdf, other]

Watch out for the risky actors: Assessing risk in dynamic environments for safe driving

Authors: Saurabh Jha, Yan Miao, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Abstract: Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the drivin… ▽ More Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the driving scenario. However, not all objects pose a similar risk. Depending on the object's type, trajectory, position, and the associated uncertainty with these quantities; some objects pose a much higher risk than others. The higher the risk associated with an actor, the more attention must be directed towards that actor in terms of resources and safety planning. In this paper, we propose a novel risk metric to calculate the importance of each actor in the world and demonstrate its usefulness through a case study. △ Less

Submitted 19 October, 2021; originally announced October 2021.

Comments: preprint version

arXiv:2109.11666 [pdf, other]

SLO beyond the Hardware Isolation Limits

Authors: Haoran Qiu, Yongzhou Chen, Tianyin Xu, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Abstract: Performance isolation is a keystone for SLO guarantees with shared resources in cloud and datacenter environments. To meet SLO requirements, the state of the art relies on hardware QoS support (e.g., Intel RDT) to allocate shared resources such as last-level caches and memory bandwidth for co-located latency-critical applications. As a result, the number of latency-critical applications that can b… ▽ More Performance isolation is a keystone for SLO guarantees with shared resources in cloud and datacenter environments. To meet SLO requirements, the state of the art relies on hardware QoS support (e.g., Intel RDT) to allocate shared resources such as last-level caches and memory bandwidth for co-located latency-critical applications. As a result, the number of latency-critical applications that can be deployed on a physical machine is bounded by the hardware allocation capability. Unfortunately, such hardware capability is very limited. For example, Intel Xeon E5 v3 processors support at most four partitions for last-level caches, i.e., at most four applications can have dedicated resource allocation. This paper discusses the feasibility and unexplored challenges of providing SLO guarantees beyond the limits of hardware capability. We present CoCo to show the feasibility and the benefits. CoCo schedules applications to time-share interference-free partitions as a transparent software layer. Our evaluation shows that CoCo outperforms non-partitioned and round-robin approaches by up to 9x and 1.2x. △ Less

Submitted 23 September, 2021; originally announced September 2021.

arXiv:2102.10837 [pdf, other]

doi 10.1145/3445814.3446739

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Authors: Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Abstract: Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in… ▽ More Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Journal ref: Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 21), 2021

arXiv:2012.07755 [pdf, other]

Application-aware Congestion Mitigation for High-Performance Computing Systems

Authors: Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew Kalbarczyk, Ravishankar Iyer

Abstract: High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework tha… ▽ More High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework that considers those network characteristics to dynamically mitigate congestion. We evaluate Netscope on four Cray Aries systems, including a production supercomputer on real scientific applications. Netscope has a lower training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7and 0.9 for common scientific applications. Moreover, we find that Netscope reduces tail runtime variability by up to 14.9 times while improving median system utility by 12%. △ Less

Submitted 3 February, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

arXiv:2008.08509 [pdf, other]

FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Authors: Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Abstract: Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user reque… ▽ More Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x. △ Less

Submitted 19 October, 2020; v1 submitted 19 August, 2020; originally announced August 2020.

Comments: This paper was accepted in OSDI '20

arXiv:2004.13004 [pdf, other]

ML-driven Malware that Targets AV Safety

Authors: Saurabh Jha, Shengkun Cui, Subho S. Banerjee, Timothy Tsai, Zbigniew Kalbarczyk, Ravi Iyer

Abstract: Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor's ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is c… ▽ More Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor's ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is compelling from an attacker's perspective. In this paper, we introduce an attack model, a method to deploy the attack in the form of smart malware, and an experimental evaluation of its impact on production-grade autonomous driving software. We find that determining the time interval during which to launch the attack is{ critically} important for causing safety hazards (such as collisions) with a high degree of success. For example, the smart malware caused 33X more forced emergency braking than random attacks did, and accidents in 52.6% of the driving simulations. △ Less

Submitted 12 June, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: Accepted for DSN 2020

Journal ref: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

arXiv:1909.02119 [pdf, other]

Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters

Authors: Subho S Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Abstract: The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significa… ▽ More The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significant training data and time, which make them challenging to use in practice. This paper presents Symphony, a scheduling framework that addresses the challenge in two ways: (i) a domain-driven Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture; and (ii) a sampling-based technique to compute the gradients of a Bayesian model without performing full probabilistic inference. Together, these techniques reduce both the amount of training data and the time required to produce scheduling policies that significantly outperform black-box approaches by up to 2.2x. △ Less

Submitted 30 June, 2020; v1 submitted 4 September, 2019; originally announced September 2019.

Comments: Scheduling, Bayesian, POMDP, Sampling, Deep Reinforcement Learning, Accelerators, FPGA, GPU

Journal ref: Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020

arXiv:1907.10203 [pdf, other]

Live Forensics for Distributed Storage Systems

Authors: Saurabh Jha, Shengkun Cui, Tianyin Xu, Jeremy Enos, Mike Showerman, Mark Dalton, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

Abstract: We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key fe… ▽ More We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead. △ Less

Submitted 23 July, 2019; originally announced July 2019.

arXiv:1907.05312 [pdf, other]

A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

Authors: Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Mike Showerman, Eric Roman, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

Abstract: Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applicat… ▽ More Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology. △ Less

Submitted 11 July, 2019; originally announced July 2019.

Comments: Accepted for HOTI2019

arXiv:1907.01051 [pdf, other]

ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

Authors: Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva K. S. Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, Ravishankar K. Iyer

Abstract: The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault inj… ▽ More The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault injection engine, which can mine situations and faults that maximally impact AV safety, as demonstrated on two industry-grade AV technology stacks (from NVIDIA and Baidu). For example, DriveFI found 561 safety-critical faults in less than 4 hours. In comparison, random injection experiments executed over several weeks could not find any safety-critical faults △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: Accepted at 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

arXiv:1907.01024 [pdf, other]

Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors

Authors: Saurabh Jha, Timothy Tsai, Siva Hari, Michael Sullivan, Zbigniew Kalbarczyk, Stephen W. Keckler, Ravishankar K. Iyer

Abstract: Fully autonomous vehicles (AVs), i.e., AVs with autonomy level 5, are expected to dominate road transportation in the near-future and contribute trillions of dollars to the global economy. The general public, government organizations, and manufacturers all have significant concern regarding resiliency and safety standards of the autonomous driving system (ADS) of AVs . In this work, we proposed an… ▽ More Fully autonomous vehicles (AVs), i.e., AVs with autonomy level 5, are expected to dominate road transportation in the near-future and contribute trillions of dollars to the global economy. The general public, government organizations, and manufacturers all have significant concern regarding resiliency and safety standards of the autonomous driving system (ADS) of AVs . In this work, we proposed and developed (a) `Kayotee' - a fault injection-based tool to systematically inject faults into software and hardware components of the ADS to assess the safety and reliability of AVs to faults and errors, and (b) an ontology model to characterize errors and safety violations impacting reliability and safety of AVs. Kayotee is capable of characterizing fault propagation and resiliency at different levels - (a) hardware, (b) software, (c) vehicle dynamics, and (d) traffic resilience. We used Kayotee to study a proprietary ADS technology built by Nvidia corporation and are currently applying Kayotee to other open-source ADS systems. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: Presented at Automotive Reliability and Testing (ART) 2018 colocated with International Testing Conference

arXiv:1907.01019 [pdf, other]

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Authors: Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Bill Krammer

Abstract: We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-f… ▽ More We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: Presented at Cray User Group 2017

arXiv:1803.02657 [pdf, other]

doi 10.1109/TC.2018.2875733

ASAP: Accelerated Short-Read Alignment on Programmable Hardware

Authors: Subho S. Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Zbigniew T. Kalbarczyk, Deming Chen, Steve Lumetta, Ravishankar K. Iyer

Abstract: The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preproc… ▽ More The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preprocessing sequenced genomes. We focus on the Levenshtein distance (edit-distance) computation kernel and propose the ASAP accelerator, which utilizes the intrinsic delay of circuits for edit-distance computation elements as a proxy for computation. Our design is implemented on an Xilinx Virtex 7 FPGA in an IBM POWER8 system that uses the CAPI interface for cache coherence across the CPU and FPGA. Our design is $200\times$ faster than the equivalent C implementation of the kernel running on the host processor and $2.2\times$ faster for an end-to-end alignment tool for 120-150 base-pair short-read sequences. Further the design represents a $3760\times$ improvement over the CPU in performance/Watt terms. △ Less

Submitted 23 May, 2018; v1 submitted 6 March, 2018; originally announced March 2018.

arXiv:1709.05935 [pdf, other]

Data Integrity Threats and Countermeasures in Railway Spot Transmission Systems

Authors: Hoon Wei Lim, William G. Temple, Bao Anh N. Tran, Binbin Chen, Zbigniew Kalbarczyk, Jianying Zhou

Abstract: Modern trains rely on balises (communication beacons) located on the track to provide location information as they traverse a rail network. Balises, such as those conforming to the Eurobalise standard, were not designed with security in mind and are thus vulnerable to cyber attacks targeting data availability, integrity, or authenticity. In this work, we discuss data integrity threats to balise tr… ▽ More Modern trains rely on balises (communication beacons) located on the track to provide location information as they traverse a rail network. Balises, such as those conforming to the Eurobalise standard, were not designed with security in mind and are thus vulnerable to cyber attacks targeting data availability, integrity, or authenticity. In this work, we discuss data integrity threats to balise transmission modules and use high-fidelity simulation to study the risks posed by data integrity attacks. To mitigate such risk, we propose a practical two-layer solution: at the device level, we design a lightweight and low-cost cryptographic solution to protect the integrity of the location information; at the system layer, we devise a secure hybrid train speed controller to mitigate the impact under various attacks. Our simulation results demonstrate the effectiveness of our proposed solutions. △ Less

Submitted 18 September, 2017; originally announced September 2017.

arXiv:1602.02860 [pdf, ps, other]

doi 10.1145/2508859.2516705

Impact of integrity attacks on real-time pricing in smart grids

Authors: Rui Tan, Varun Badrinath Krishna, David K. Y. Yau, Zbigniew Kalbarczyk

Abstract: Modern information and communication technologies used by smart grids are subject to cybersecurity threats. This paper studies the impact of integrity attacks on real-time pricing (RTP), a key feature of smart grids that uses such technologies to improve system efficiency. Recent studies have shown that RTP creates a closed loop formed by the mutually dependent real-time price signals and price-ta… ▽ More Modern information and communication technologies used by smart grids are subject to cybersecurity threats. This paper studies the impact of integrity attacks on real-time pricing (RTP), a key feature of smart grids that uses such technologies to improve system efficiency. Recent studies have shown that RTP creates a closed loop formed by the mutually dependent real-time price signals and price-taking demand. Such a closed loop can be exploited by an adversary whose objective is to destabilize the pricing system. Specifically, small malicious modifications to the price signals can be iteratively amplified by the closed loop, causing inefficiency and even severe failures such as blackouts. This paper adopts a control-theoretic approach to deriving the fundamental conditions of RTP stability under two broad classes of integrity attacks, namely, the scaling and delay attacks. We show that the RTP system is at risk of being destabilized only if the adversary can compromise the price signals advertised to smart meters by reducing their values in the scaling attack, or by providing old prices to over half of all consumers in the delay attack. The results provide useful guidelines for system operators to analyze the impact of various attack parameters on system stability, so that they may take adequate measures to secure RTP systems. △ Less

Submitted 8 February, 2016; originally announced February 2016.

Comments: Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security

arXiv:1507.03518 [pdf]

doi 10.1371/journal.pone.0151470

Adverse Events in Robotic Surgery: A Retrospective Study of 14 Years of FDA Data

Authors: Homa Alemzadeh, Ravishankar K. Iyer, Zbigniew Kalbarczyk, Nancy Leveson, Jaishankar Raman

Abstract: Understanding the causes and patient impacts of surgical adverse events will help improve systems and operational practices to avoid incidents in the future. We analyzed the adverse events data related to robotic systems and instruments used in minimally invasive surgery, reported to the U.S. FDA MAUDE database from January 2000 to December 2013. We determined the number of events reported per pro… ▽ More Understanding the causes and patient impacts of surgical adverse events will help improve systems and operational practices to avoid incidents in the future. We analyzed the adverse events data related to robotic systems and instruments used in minimally invasive surgery, reported to the U.S. FDA MAUDE database from January 2000 to December 2013. We determined the number of events reported per procedure and per surgical specialty, the most common types of device malfunctions and their impact on patients, and the causes for catastrophic events such as major complications, patient injuries, and deaths. During the study period, 144 deaths (1.4% of the 10,624 reports), 1,391 patient injuries (13.1%), and 8,061 device malfunctions (75.9%) were reported. The numbers of injury and death events per procedure have stayed relatively constant since 2007 (mean = 83.4, 95% CI, 74.2-92.7). Surgical specialties, for which robots are extensively used, such as gynecology and urology, had lower number of injuries, deaths, and conversions per procedure than more complex surgeries, such as cardiothoracic and head and neck (106.3 vs. 232.9, Risk Ratio = 2.2, 95% CI, 1.9-2.6). Device and instrument malfunctions, such as falling of burnt/broken pieces of instruments into the patient (14.7%), electrical arcing of instruments (10.5%), unintended operation of instruments (8.6%), system errors (5%), and video/imaging problems (2.6%), constituted a major part of the reports. Device malfunctions impacted patients in terms of injuries or procedure interruptions. In 1,104 (10.4%) of the events, the procedure was interrupted to restart the system (3.1%), to convert the procedure to non-robotic techniques (7.3%), or to reschedule it to a later time (2.5%). Adoption of advanced techniques in design and operation of robotic surgical systems may reduce these preventable incidents in the future. △ Less

Submitted 20 July, 2015; v1 submitted 13 July, 2015; originally announced July 2015.

Comments: Presented as the J. Maxwell Chamberlain Memorial Paper for adult cardiac surgery at the 50th Annual Meeting of the Society of Thoracic Surgeons in January. See Appendix for more detailed results, discussions, and related work. Updated the headers

Journal ref: PLOS ONE 11(4) (2016) e0151470

arXiv:1504.07135 [pdf]

doi 10.1007/978-3-319-24255-2_16

Systems-theoretic Safety Assessment of Robotic Telesurgical Systems

Authors: Homa Alemzadeh, Daniel Chen, Andrew Lewis, Zbigniew Kalbarczyk, Jaishankar Raman, Nancy Leveson, Ravishankar K. Iyer

Abstract: Robotic telesurgical systems are one of the most complex medical cyber-physical systems on the market, and have been used in over 1.75 million procedures during the last decade. Despite significant improvements in design of robotic surgical systems through the years, there have been ongoing occurrences of safety incidents during procedures that negatively impact patients. This paper presents an ap… ▽ More Robotic telesurgical systems are one of the most complex medical cyber-physical systems on the market, and have been used in over 1.75 million procedures during the last decade. Despite significant improvements in design of robotic surgical systems through the years, there have been ongoing occurrences of safety incidents during procedures that negatively impact patients. This paper presents an approach for systems-theoretic safety assessment of robotic telesurgical systems using software-implemented fault-injection. We used a systemstheoretic hazard analysis technique (STPA) to identify the potential safety hazard scenarios and their contributing causes in RAVEN II robot, an open-source robotic surgical platform. We integrated the robot control software with a softwareimplemented fault-injection engine which measures the resilience of the system to the identified safety hazard scenarios by automatically inserting faults into different parts of the robot control software. Representative hazard scenarios from real robotic surgery incidents reported to the U.S. Food and Drug Administration (FDA) MAUDE database were used to demonstrate the feasibility of the proposed approach for safety-based design of robotic telesurgical systems. △ Less

Submitted 8 July, 2015; v1 submitted 27 April, 2015; originally announced April 2015.

Comments: Revise based on reviewers feedback. To appear in the the International Conference on Computer Safety, Reliability, and Security (SAFECOMP) 2015

arXiv:1405.7475 [pdf, other]

Automatic Generation of Security Argument Graphs

Authors: Nils Ole Tippenhauer, William G. Temple, An Hoa Vu, Binbin Chen, David M. Nicol, Zbigniew Kalbarczyk, William H. Sanders

Abstract: Graph-based assessment formalisms have proven to be useful in the safety, dependability, and security communities to help stakeholders manage risk and maintain appropriate documentation throughout the system lifecycle. In this paper, we propose a set of methods to automatically construct security argument graphs, a graphical formalism that integrates various security-related information to argue a… ▽ More Graph-based assessment formalisms have proven to be useful in the safety, dependability, and security communities to help stakeholders manage risk and maintain appropriate documentation throughout the system lifecycle. In this paper, we propose a set of methods to automatically construct security argument graphs, a graphical formalism that integrates various security-related information to argue about the security level of a system. Our approach is to generate the graph in a progressive manner by exploiting logical relationships among pieces of diverse input information. Using those emergent argument patterns as a starting point, we define a set of extension templates that can be applied iteratively to grow a security argument graph. Using a scenario from the electric power sector, we demonstrate the graph generation process and highlight its application for system security evaluation in our prototype software tool, CyberSAGE. △ Less

Submitted 29 May, 2014; originally announced May 2014.

Comments: 10 pages, 8 figures, 1 table and 2 algorithms

arXiv:0704.0879 [pdf]

A Hierarchical Approach for Dependability Analysis of a Commercial Cache-Based RAID Storage Architecture

Authors: Mohamed Kaaniche, Luigi Romano, Zbigniew Kalbarczyk, Ravishankar Iyer, Rick Karcich

Abstract: We present a hierarchical simulation approach for the dependability analysis and evaluation of a highly available commercial cache-based RAID storage system. The archi-tecture is complex and includes several layers of overlap-ping error detection and recovery mechanisms. Three ab-straction levels have been developed to model the cache architecture, cache operations, and error detection and recov… ▽ More We present a hierarchical simulation approach for the dependability analysis and evaluation of a highly available commercial cache-based RAID storage system. The archi-tecture is complex and includes several layers of overlap-ping error detection and recovery mechanisms. Three ab-straction levels have been developed to model the cache architecture, cache operations, and error detection and recovery mechanism. The impact of faults and errors oc-curring in the cache and in the disks is analyzed at each level of the hierarchy. A simulation submodel is associated with each abstraction level. The models have been devel-oped using DEPEND, a simulation-based environment for system-level dependability analysis, which provides facili-ties to inject faults into a functional behavior model, to simulate error detection and recovery mechanisms, and to evaluate quantitative measures. Several fault models are defined for each submodel to simulate cache component failures, disk failures, transmission errors, and data errors in the cache memory and in the disks. Some of the parame-ters characterizing fault injection in a given submodel cor-respond to probabilities evaluated from the simulation of the lower-level submodel. Based on the proposed method-ology, we evaluate and analyze 1) the system behavior un-der a real workload and high error rate (focusing on error bursts), 2) the coverage of the error detection mechanisms implemented in the system and the error latency distribu-tions, and 3) the accumulation of errors in the cache and in the disks. △ Less

Submitted 6 April, 2007; originally announced April 2007.

Journal ref: Proc. 28th IEEE International Symposium on Fault-Tolerant Computing (FTCS-28), Munich (Germany), IEEE Computer Society, June 1998, pp.6-15 (1998) 6-15

Showing 1–25 of 25 results for author: Kalbarczyk, Z