Job Type: Full Time

Locations: Atlanta

Employment Type :
Full-time

Experience :
10+ Years

Required Skills :

Technical Skills

Elastic Expert : 5+ years of production experience with the Elastic Stack (Elasticsearch, Kibana, Logstash, Beats).
Kubernetes Mastery : 3+ years managing Elastic Cloud on Kubernetes (ECK) or similar operators on enterprise K8s distributions (Anthos, GKE, or EKS).
Ingest & Pipelines: Deep knowledge of log ingest architectures, index templates, sharding strategies, and cluster tuning.

Leadership & Experience

Onsite Leadership : Proven ability to run day-to-day operations, manage technical rosters, and lead cross-functional troubleshooting sessions.
Enterprise Scale : Experience supporting large-scale platforms (e.g., ETL jobs, microservices, anomaly detection) in a complex corporate environment.
Incident Management : Familiarity with SRE practices, including incident response (P1-P3), MTTR tracking, and root cause analysis.

Preferred Skills

Experience with Infrastructure-as-Code (Terraform, Helm) and CI/CD pipelines (Jenkins,GitLab).
Knowledge of migration strategies between legacy logging platforms (e.g., Splunk) and Elastic.

Responsibilities:

1. Platform Architecture & Onsite Leadership

Onsite Operational Lead : Act as the primary point of contact for platform stability. Manage daily stand-ups, prioritize the engineering backlog, and coordinate between offshore and onsite teams.
Architecture & Reliability : Own the Elastic Stack (Elasticsearch, Kibana, Ingest components) deployed on Elastic Cloud on Kubernetes (ECK).
Capacity Planning : Design cluster topology (nodes, roles, zones) and manage resource quotas (CPU, Heap, Disk) to ensure cost-efficiency and performance.
SLO Management : Define and track Service Level Objectives (SLOs) for ingestion latency, search availability, and data retention.

2. Logging Strategy & Data Modeling

Standardization : Define enterprise logging and index templates, including field conventions (service, environment, tenant, correlation IDs) to ensure reliable event correlation across the observability stack.
Schema Design : Work with application teams to implement standardized mappings and Index Lifecycle Management (ILM) policies (Hot/Warm/Cold tiers, rollover, and retention).
Data Quality : Own ingest patterns using Filebeat, Fluent Bit, Logstash, or Elastic Agent. Design parsing pipelines (JSON, Grok), enrichment logic, and dead-letter queue strategies.

3. Kubernetes & Infrastructure Operations

Cluster Management : Oversee daily health across Kubernetes-based clusters (e.g., Anthos/GKE). Resolve pod-level issues such as CrashLoopBackOffs, memory spikes, and disk usage alerts.
Storage Operations : Lead the management and migration of Persistent Volume Claims (PVCs) for stateful sets, ensuring high availability during infrastructure upgrades.
Security & Governance : Implement RBAC for indices and Kibana spaces. Enforce data governance and retention policies based on data classification (Infra logs vs. App logs vs. Sensitive data).

4. Observability & Enablement

Self-Service Enablement : Deliver Kibana dashboards and “Golden Queries” to support SRE and NOC teams in rapid incident triaging.
Documentation & Runbooks : Author and maintain operational runbooks, disaster recovery scenarios, and “Self-Help” guides for onboarding new log sources.
Mentorship : Provide technical guidance and training sessions for platform engineers and application developers on effective search and logging practices.

Apply for this position

Alternatively, you may email your resume to careers@softility.com