Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–25 of 25 results for author: Kalbarczyk, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.11169  [pdf, other

    cs.DC

    Mutiny! How does Kubernetes fail, and what can we do about it?

    Authors: Marco Barletta, Marcello Cinque, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  2. arXiv:2404.08509  [pdf, other

    cs.DC cs.CL cs.LG

    Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

    Authors: Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer

    Abstract: Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line bloc… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted at AIOps'24

  3. arXiv:2404.00869  [pdf, other

    cs.CR

    Towards Automated Generation of Smart Grid Cyber Range for Cybersecurity Experiments and Training

    Authors: Daisuke Mashima, Muhammad M. Roomi, Bennet Ng, Zbigniew Kalbarczyk, S. M. Suhail Hussain, Ee-chien Chang

    Abstract: Assurance of cybersecurity is crucial to ensure dependability and resilience of smart power grid systems. In order to evaluate the impact of potential cyber attacks, to assess deployability and effectiveness of cybersecurity measures, and to enable hands-on exercise and training of personals, an interactive, virtual environment that emulates the behaviour of a smart grid system, namely smart grid… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Published at DSN 2023 Industry Track

  4. arXiv:2403.07890  [pdf, other

    cs.GT cs.AI cs.LG

    $\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games

    Authors: Weichao Mao, Haoran Qiu, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar

    Abstract: No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analog… ▽ More

    Submitted 23 April, 2024; v1 submitted 2 February, 2024; originally announced March 2024.

  5. arXiv:2403.05448  [pdf, other

    cs.CR

    On Practicality of Using ARM TrustZone Trusted Execution Environment for Securing Programmable Logic Controllers

    Authors: Zhiang Li, Daisuke Mashima, Wen Shei Ong, Ertem Esiner, Zbigniew Kalbarczyk, Ee-Chien Chang

    Abstract: Programmable logic controllers (PLCs) are crucial devices for implementing automated control in various industrial control systems (ICS), such as smart power grids, water treatment systems, manufacturing, and transportation systems. Owing to their importance, PLCs are often the target of cyber attackers that are aiming at disrupting the operation of ICS, including the nation's critical infrastruct… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: To appear at ACM AsiaCCS 2024

  6. arXiv:2206.00886  [pdf, other

    cs.RO cs.LG

    Watch Out for the Safety-Threatening Actors: Proactively Mitigating Safety Hazards

    Authors: Saurabh Jha, Shengkun Cui, Zbigniew Kalbarczyk, Ravishankar K. Iyer

    Abstract: Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  7. arXiv:2110.09998  [pdf, other

    cs.AI cs.RO

    Watch out for the risky actors: Assessing risk in dynamic environments for safe driving

    Authors: Saurabh Jha, Yan Miao, Zbigniew Kalbarczyk, Ravishankar K. Iyer

    Abstract: Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the drivin… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

    Comments: preprint version

  8. arXiv:2109.11666  [pdf, other

    cs.OS cs.PF

    SLO beyond the Hardware Isolation Limits

    Authors: Haoran Qiu, Yongzhou Chen, Tianyin Xu, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Performance isolation is a keystone for SLO guarantees with shared resources in cloud and datacenter environments. To meet SLO requirements, the state of the art relies on hardware QoS support (e.g., Intel RDT) to allocate shared resources such as last-level caches and memory bandwidth for co-located latency-critical applications. As a result, the number of latency-critical applications that can b… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

  9. arXiv:2102.10837  [pdf, other

    cs.DC cs.AI cs.AR cs.PF

    BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

    Authors: Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Journal ref: Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 21), 2021

  10. arXiv:2012.07755  [pdf, other

    cs.DC cs.NI

    Application-aware Congestion Mitigation for High-Performance Computing Systems

    Authors: Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew Kalbarczyk, Ravishankar Iyer

    Abstract: High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework tha… ▽ More

    Submitted 3 February, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

  11. arXiv:2008.08509  [pdf, other

    cs.DC cs.PF

    FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

    Authors: Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user reque… ▽ More

    Submitted 19 October, 2020; v1 submitted 19 August, 2020; originally announced August 2020.

    Comments: This paper was accepted in OSDI '20

  12. arXiv:2004.13004  [pdf, other

    cs.CR cs.CV cs.LG cs.RO

    ML-driven Malware that Targets AV Safety

    Authors: Saurabh Jha, Shengkun Cui, Subho S. Banerjee, Timothy Tsai, Zbigniew Kalbarczyk, Ravi Iyer

    Abstract: Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor's ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is c… ▽ More

    Submitted 12 June, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: Accepted for DSN 2020

    Journal ref: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

  13. arXiv:1909.02119  [pdf, other

    cs.DC cs.LG

    Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters

    Authors: Subho S Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significa… ▽ More

    Submitted 30 June, 2020; v1 submitted 4 September, 2019; originally announced September 2019.

    Comments: Scheduling, Bayesian, POMDP, Sampling, Deep Reinforcement Learning, Accelerators, FPGA, GPU

    Journal ref: Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020

  14. arXiv:1907.10203  [pdf, other

    cs.DC cs.LG

    Live Forensics for Distributed Storage Systems

    Authors: Saurabh Jha, Shengkun Cui, Tianyin Xu, Jeremy Enos, Mike Showerman, Mark Dalton, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

    Abstract: We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key fe… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

  15. arXiv:1907.05312  [pdf, other

    cs.DC cs.NI

    A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

    Authors: Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Mike Showerman, Eric Roman, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

    Abstract: Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applicat… ▽ More

    Submitted 11 July, 2019; originally announced July 2019.

    Comments: Accepted for HOTI2019

  16. arXiv:1907.01051  [pdf, other

    cs.LG cs.SE stat.ML

    ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

    Authors: Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva K. S. Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, Ravishankar K. Iyer

    Abstract: The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault inj… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: Accepted at 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

  17. arXiv:1907.01024  [pdf, other

    cs.SE

    Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors

    Authors: Saurabh Jha, Timothy Tsai, Siva Hari, Michael Sullivan, Zbigniew Kalbarczyk, Stephen W. Keckler, Ravishankar K. Iyer

    Abstract: Fully autonomous vehicles (AVs), i.e., AVs with autonomy level 5, are expected to dominate road transportation in the near-future and contribute trillions of dollars to the global economy. The general public, government organizations, and manufacturers all have significant concern regarding resiliency and safety standards of the autonomous driving system (ADS) of AVs . In this work, we proposed an… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: Presented at Automotive Reliability and Testing (ART) 2018 colocated with International Testing Conference

  18. arXiv:1907.01019  [pdf, other

    cs.DC

    Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

    Authors: Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Bill Krammer

    Abstract: We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-f… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: Presented at Cray User Group 2017

  19. ASAP: Accelerated Short-Read Alignment on Programmable Hardware

    Authors: Subho S. Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Zbigniew T. Kalbarczyk, Deming Chen, Steve Lumetta, Ravishankar K. Iyer

    Abstract: The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preproc… ▽ More

    Submitted 23 May, 2018; v1 submitted 6 March, 2018; originally announced March 2018.

  20. arXiv:1709.05935  [pdf, other

    cs.CR

    Data Integrity Threats and Countermeasures in Railway Spot Transmission Systems

    Authors: Hoon Wei Lim, William G. Temple, Bao Anh N. Tran, Binbin Chen, Zbigniew Kalbarczyk, Jianying Zhou

    Abstract: Modern trains rely on balises (communication beacons) located on the track to provide location information as they traverse a rail network. Balises, such as those conforming to the Eurobalise standard, were not designed with security in mind and are thus vulnerable to cyber attacks targeting data availability, integrity, or authenticity. In this work, we discuss data integrity threats to balise tr… ▽ More

    Submitted 18 September, 2017; originally announced September 2017.

  21. Impact of integrity attacks on real-time pricing in smart grids

    Authors: Rui Tan, Varun Badrinath Krishna, David K. Y. Yau, Zbigniew Kalbarczyk

    Abstract: Modern information and communication technologies used by smart grids are subject to cybersecurity threats. This paper studies the impact of integrity attacks on real-time pricing (RTP), a key feature of smart grids that uses such technologies to improve system efficiency. Recent studies have shown that RTP creates a closed loop formed by the mutually dependent real-time price signals and price-ta… ▽ More

    Submitted 8 February, 2016; originally announced February 2016.

    Comments: Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security

  22. Adverse Events in Robotic Surgery: A Retrospective Study of 14 Years of FDA Data

    Authors: Homa Alemzadeh, Ravishankar K. Iyer, Zbigniew Kalbarczyk, Nancy Leveson, Jaishankar Raman

    Abstract: Understanding the causes and patient impacts of surgical adverse events will help improve systems and operational practices to avoid incidents in the future. We analyzed the adverse events data related to robotic systems and instruments used in minimally invasive surgery, reported to the U.S. FDA MAUDE database from January 2000 to December 2013. We determined the number of events reported per pro… ▽ More

    Submitted 20 July, 2015; v1 submitted 13 July, 2015; originally announced July 2015.

    Comments: Presented as the J. Maxwell Chamberlain Memorial Paper for adult cardiac surgery at the 50th Annual Meeting of the Society of Thoracic Surgeons in January. See Appendix for more detailed results, discussions, and related work. Updated the headers

    Journal ref: PLOS ONE 11(4) (2016) e0151470

  23. Systems-theoretic Safety Assessment of Robotic Telesurgical Systems

    Authors: Homa Alemzadeh, Daniel Chen, Andrew Lewis, Zbigniew Kalbarczyk, Jaishankar Raman, Nancy Leveson, Ravishankar K. Iyer

    Abstract: Robotic telesurgical systems are one of the most complex medical cyber-physical systems on the market, and have been used in over 1.75 million procedures during the last decade. Despite significant improvements in design of robotic surgical systems through the years, there have been ongoing occurrences of safety incidents during procedures that negatively impact patients. This paper presents an ap… ▽ More

    Submitted 8 July, 2015; v1 submitted 27 April, 2015; originally announced April 2015.

    Comments: Revise based on reviewers feedback. To appear in the the International Conference on Computer Safety, Reliability, and Security (SAFECOMP) 2015

  24. arXiv:1405.7475  [pdf, other

    cs.CR

    Automatic Generation of Security Argument Graphs

    Authors: Nils Ole Tippenhauer, William G. Temple, An Hoa Vu, Binbin Chen, David M. Nicol, Zbigniew Kalbarczyk, William H. Sanders

    Abstract: Graph-based assessment formalisms have proven to be useful in the safety, dependability, and security communities to help stakeholders manage risk and maintain appropriate documentation throughout the system lifecycle. In this paper, we propose a set of methods to automatically construct security argument graphs, a graphical formalism that integrates various security-related information to argue a… ▽ More

    Submitted 29 May, 2014; originally announced May 2014.

    Comments: 10 pages, 8 figures, 1 table and 2 algorithms

  25. arXiv:0704.0879  [pdf

    cs.PF

    A Hierarchical Approach for Dependability Analysis of a Commercial Cache-Based RAID Storage Architecture

    Authors: Mohamed Kaaniche, Luigi Romano, Zbigniew Kalbarczyk, Ravishankar Iyer, Rick Karcich

    Abstract: We present a hierarchical simulation approach for the dependability analysis and evaluation of a highly available commercial cache-based RAID storage system. The archi-tecture is complex and includes several layers of overlap-ping error detection and recovery mechanisms. Three ab-straction levels have been developed to model the cache architecture, cache operations, and error detection and recov… ▽ More

    Submitted 6 April, 2007; originally announced April 2007.

    Journal ref: Proc. 28th IEEE International Symposium on Fault-Tolerant Computing (FTCS-28), Munich (Germany), IEEE Computer Society, June 1998, pp.6-15 (1998) 6-15