Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content
View chitender's full-sized avatar
  • StackVigil
  • Hyderabad

Block or report chitender

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
chitender/README.md
~/career/ — chitender@platform-eng — zsh — 132×40                          ⎈ ● k8s-prod-2026

Chitender_Kumar_

Senior Cloud Architect · Platform & SRE Engineering #kubernetes #sre #ebpf #finops #ai-ops #observability


email    : chitenderkumar.16@gmail.com
phone    : +91 8130 089 176
linkedin : linkedin.com/in/chitenderkumar
location : Delhi, India · open to remote
cert     : CKA / CNCF
seeking  : Staff / Principal / SRE Manager

$ summary --short

Senior Cloud Architect and SRE leader with 10+ years delivering production-grade Kubernetes platforms on AWS EKS and Azure AKS at scale. Specializes in reliability engineering (SLO/SLI/error budgets), cost-optimized compute, and automation-first operations.

Track record of shipping internal platforms — including an AI-assisted incident response system (Aria/InfraInsights), eBPF-based CNI migration (Cilium), and Karpenter-driven FinOps programs — that measurably reduce MTTR, alert noise, and operational toil.

Metric Value Context
🕐 Years in production 10+ Platform & SRE
⚡ RCA surface time < 2 min Aria AI assistant
📉 Alert noise reduction −30% Over 18 months
🧠 Incident pairs trained 45K RAG corpus

$ kubectl get experience -o wide

▸ Senior Cloud Architect @ Innovaccer · Jul 2024 — Present

Lead platform engineering & SRE for a healthcare data activation platform running thousands of Kubernetes workloads across EKS and AKS.

Domain Project Impact
[networking] Cilium CNI migration on EKS/AKS — Led end-to-end migration from legacy CNI to Cilium (eBPF). Enforced L4/L7 policy, improved network observability, reduced connection-level latency Zero production outages during cutover
[finops] Karpenter program at scale — Spot/On-Demand mix, instance diversity, disruption windows. Right-sized node pools via load testing Cut compute cost while holding P99 within SLO during peak
[ai-ops] Aria — AI-assisted incident response — LLM-powered RCA assistant (Gemma 4 + Qdrant + Ollama on AKS), trained on ~45K historical incident Q&A pairs Surfaces root causes in < 2 min inside on-call workflows
[governance] Multi-account lifecycle tooling — Python + Kubernetes API tooling for cross-account operations using AWS STS Reduced manual provisioning overhead; enforced security guardrails at scale

▸ Lead SRE → SRE Manager → Staff SRE @ Innovaccer · Feb 2021 — Sep 2024

Built and led the SRE function from IC to team lead. Owned availability, incident command, and measurable reliability improvements across critical services.

Domain Project Impact
[observability] InfraInsights platform — Full-stack observability + AI-RCA platform (Prometheus / Grafana / OTel / CloudWatch) with SLO dashboards, burn-rate alerts, and auto-linked playbooks Replaced fragmented monitoring with a single pane of glass
[controllers] KubeNightwatch operator — Production K8s controller enforcing time-window and change-freeze policies; Slack + Jira integration; deletes out-of-policy resources Eliminated an entire class of unauthorized change incidents
[data] Kafka & OpenSearch reliability — ISR instability, partition skew, GC-related consumer lag via tuning, scaling patterns, and codified runbooks Significant drop in repeat Kafka incident volume
[chaos] Chaos engineering practice — Spot interruption drills and failover chaos tests Exposed hidden single points of failure before they became incidents
[platform] Postmortem culture — Drove postmortems to time-bound, code-level fixes Reduced noisy pages > 30% over 18 months; materially improved MTTR

▸ Earlier Roles

Company Role Period
Atlan SDE II — Platform Engineering Feb 2020 — Jan 2021
Delhivery Senior DevOps Engineer Jun 2019 — Feb 2020
Innovaccer Infrastructure Engineer Aug 2017 — Jul 2019
Telenity Solution Architect / Engineer Operations Feb 2015 — Aug 2017

$ ls ./signature-projects/

🤖 Aria · AI INCIDENT ASSISTANT

LLM-powered RCA system (Gemma 4 + Qdrant + Ollama on AKS), trained on ~45K Q&A pairs from historical incidents.

Surfaces root-cause hypotheses in < 2 min, integrated into on-call tooling.

Gemma4 Qdrant Ollama AKS RAG

📊 InfraInsights · OBSERVABILITY PLATFORM

Prometheus / Grafana / OTel / CloudWatch unified observability with SLO dashboards, burn-rate alerts, and auto-linked runbooks.

Replaced fragmented monitoring across multi-tenant EKS/AKS.

Prometheus Grafana OTel CloudWatch SLO

🔒 KubeNightwatch · CHANGE-FREEZE OPERATOR

Kubernetes operator enforcing time-window and change-freeze policies; deletes out-of-policy resources; Slack/Jira integration.

Eliminated unauthorized change incidents entirely.

Go Kubernetes Operator Slack Jira

💰 Karpenter at Scale · FINOPS AUTOSCALING

Workload-aware Spot/On-Demand autoscaling with instance diversity, disruption windows, and forecast-based pre-warm for batch workloads.

Cut compute cost while sustaining P99 SLOs.

Karpenter Spot EKS AKS FinOps


$ env | grep STACK

KUBERNETES_CLOUD  = "EKS, AKS, GKE, Karpenter, Cilium (eBPF), Helm, Argo CD, Flux, Operators, KEDA"
CLOUD_PLATFORMS   = "AWS (EKS, STS, EC2, RDS, S3, CloudWatch, IAM), Azure (AKS, Monitor), GCP"
OBSERVABILITY     = "Prometheus, Grafana, OpenTelemetry, Alertmanager, CloudWatch, Jaeger, Loki"
IAC_AUTOMATION    = "Terraform, Ansible, Python, Go, Bash, Jenkins, GitLab CI, GitHub Actions"
DATA_MESSAGING    = "Kafka, OpenSearch / Elasticsearch, Redis, PostgreSQL, Qdrant (vector DB)"
AI_ML_OPS         = "LLM integration (Gemma 4, Ollama), RAG pipelines, vector embedding, AI-assisted RCA"
RELIABILITY       = "SLO/SLI/Error Budgets, Incident Command, Chaos Engineering, DORA Metrics, Postmortems"

$ cat ./credentials.txt

✓  Certified Kubernetes Administrator (CKA) — CNCF
✓  B.Tech, Computer Science — LDIET, 2016

$ git log --oneline --graph (what I'm working on)

* feat(ai-ops):     Aria v2 — multi-cluster incident correlation
* feat(finops):     Karpenter forecast-based pre-warm for batch windows
* feat(ebpf):       Cilium Hubble flows → OTel pipeline
* chore(platform):  Multi-tenant control/data plane IaC standardization

K8S · EBPF · OTEL · KARPENTER · KAFKA · TERRAFORM · GO · PYTHON · LLM-OPS

chitender@platform:~ $ exit 0   # thanks for reading — ck / 2026

LinkedIn Email CKA

Popular repositories Loading

  1. Prometheus-setup Prometheus-setup Public

    HCL 2

  2. katacoda-scenarios katacoda-scenarios Public

    Katacoda Scenarios

    1

  3. kube-SmartScheduler kube-SmartScheduler Public

    SmartScheduler: An intelligent Kubernetes operator that dynamically assigns pods to nodes based on custom weights, base counts, and label-based placement strategies for optimal workload distribution.

    Go 1

  4. crmportal crmportal Public

    FIRST PROECT

    Java

  5. proect1 proect1 Public

  6. git git Public

    crm

    Java