Principal Elastic Platform Owner & Onsite Leader

Job Type: Full Time
Locations: Atlanta

Employment Type :
Full-time

Experience :
10+ Years

Required Skills :

Technical Skills

  • Elastic Expert : 5+ years of production experience with the Elastic Stack (Elasticsearch, Kibana, Logstash, Beats).
  • Kubernetes Mastery : 3+ years managing Elastic Cloud on Kubernetes (ECK) or similar operators on enterprise K8s distributions (Anthos, GKE, or EKS).
  • Ingest & Pipelines: Deep knowledge of log ingest architectures, index templates, sharding strategies, and cluster tuning.

Leadership & Experience

  • Onsite Leadership : Proven ability to run day-to-day operations, manage technical rosters, and lead cross-functional troubleshooting sessions.
  • Enterprise Scale : Experience supporting large-scale platforms (e.g., ETL jobs, microservices, anomaly detection) in a complex corporate environment.
  • Incident Management : Familiarity with SRE practices, including incident response (P1-P3), MTTR tracking, and root cause analysis.

Preferred Skills

  • Experience with Infrastructure-as-Code (Terraform, Helm) and CI/CD pipelines (Jenkins,GitLab).
  • Knowledge of migration strategies between legacy logging platforms (e.g., Splunk) and Elastic.

Responsibilities:

1. Platform Architecture & Onsite Leadership

  • Onsite Operational Lead : Act as the primary point of contact for platform stability. Manage daily stand-ups, prioritize the engineering backlog, and coordinate between offshore and onsite teams.
  • Architecture & Reliability : Own the Elastic Stack (Elasticsearch, Kibana, Ingest components) deployed on Elastic Cloud on Kubernetes (ECK).
  • Capacity Planning : Design cluster topology (nodes, roles, zones) and manage resource quotas (CPU, Heap, Disk) to ensure cost-efficiency and performance.
  • SLO Management : Define and track Service Level Objectives (SLOs) for ingestion latency, search availability, and data retention.

2. Logging Strategy & Data Modeling

  • Standardization : Define enterprise logging and index templates, including field conventions (service, environment, tenant, correlation IDs) to ensure reliable event correlation across the observability stack.
  • Schema Design : Work with application teams to implement standardized mappings and Index Lifecycle Management (ILM) policies (Hot/Warm/Cold tiers, rollover, and retention).
  • Data Quality : Own ingest patterns using Filebeat, Fluent Bit, Logstash, or Elastic Agent. Design parsing pipelines (JSON, Grok), enrichment logic, and dead-letter queue strategies.

3. Kubernetes & Infrastructure Operations

  • Cluster Management : Oversee daily health across Kubernetes-based clusters (e.g., Anthos/GKE). Resolve pod-level issues such as CrashLoopBackOffs, memory spikes, and disk usage alerts.
  • Storage Operations : Lead the management and migration of Persistent Volume Claims (PVCs) for stateful sets, ensuring high availability during infrastructure upgrades.
  • Security & Governance : Implement RBAC for indices and Kibana spaces. Enforce data governance and retention policies based on data classification (Infra logs vs. App logs vs. Sensitive data).

4. Observability & Enablement

  • Self-Service Enablement : Deliver Kibana dashboards and “Golden Queries” to support SRE and NOC teams in rapid incident triaging.
  • Documentation & Runbooks : Author and maintain operational runbooks, disaster recovery scenarios, and “Self-Help” guides for onboarding new log sources.
  • Mentorship : Provide technical guidance and training sessions for platform engineers and application developers on effective search and logging practices.

Apply for this position

Alternatively, you may email your resume to careers@softility.com

Drop files here or click to uploadMaximum allowed file size is 128 MB.
Allowed Type(s): .pdf, .doc, .docx