ECS is seeking an Elastic SRE Engineer to work in our Fairfax, VA office.
ECS is seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better awareness and visibility into their security posture and cyber threats. ECS is responsible for designing, building, deploying, operating, and maintaining a complete ‘Data Services’ solution which includes the collection, normalization, visualization, and sharing of cyber data from more than 100 Federal agencies. The CDM Data Services product is an integrated suite of multiple Commercial Off the Shelf (COTS) products, software configuration packages, and custom code which work together to operate as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements.
We are seeking professionals who thrive in a dynamic, fast-paced, and highly collaborative environment where problem-solving, critical thinking, and a holistic approach to serving the mission are key. Our program operates within the Scaled Agile Framework (SAFe). An aptitude and enthusiasm for continuous learning, improvement, and cyber security is a must!
ECS is currently seeking a skilled Elastic Site Reliability Engineer (SRE) to support the Department of Homeland Security (DHS) Continuous Diagnostics and Mitigation (CDM) SIEM as a Service (SIEMaaS) Project. The CDM SIEMaaS project provides SIEM platform and integration services to participating agencies to support them in focusing their respective security posture on operationalizing their SIEM. The Elastic SRE will focus on maintaining and optimizing Elastic deployments in Elastic Cloud Hosted (ECH). The Elastic SRE will ensure effective monitoring for cluster health, availability, performance, and cost.
The ideal Elasticsearch SRE Engineer candidate must be able to work independently and proactively in finding solutions, and within a dynamic team structure to achieve program objectives. This person primarily performs duties of:
- Monitor and maintain the health, uptime, and availability of Elastic Deployments in Elastic Cloud Hosted (ECH) using an Elastic logging / observability cluster, ensuring compliance with service-level agreements (SLAs) and service-level objectives (SLOs).
- Analyze and optimize cluster performance (e.g., indexing, search latency, resource utilization) to meet business and tenant requirements.
- Implement cost optimization strategies (e.g., right-sizing nodes, optimizing storage tiers) to reduce operational costs while maintaining performance and reliability.
- Support Elastic SIEM Engineers to troubleshoot service degradation impacting SLA or SLO.
- Develop and maintain automation scripts and tools (e.g., via ECH APIs, Python) for cluster management and tenant onboarding to reduce manual effort.
- Forecast resource needs and plan cluster scaling within ECH to support growth in data volume and query load, ensuring scalability and resilience.
- Conduct gap analyses for prospective tenants’ Elastic environments to assess health, stability, adherence to Elastic best practices, and optimization opportunities, providing actionable recommendations.
- Collaborate with development, DevOps and SIEM Engineers to align Elastic configurations with application needs and business objectives.
- Create and maintain comprehensive documentation for cluster configurations and monitoring processes.
- Must be a US Citizen and able to acquire DHS Public Trust Suitability
- Excellent written and verbal communication skills, detail oriented, effective interpersonal skills, strong organization skills, problem-solving ability, attention to detail, technical documentation skills and strong work ethic that is proactive and self-motivated.
- Must consistently seek to improve quality and efficiency.
- Be flexible and thrive in an evolving environment.
- Must be able to apply a proactive mindset for detecting potential areas of SLA and SLO breaches.
- Minimum of 3 years’ experience deploying, managing, and monitoring Elastic-based Tooling (e.g. Elastic Stack, Logstash, Beats, Elastic Agent).
- Minimum of 3 years’ experience managing Elastic, or like technologies (e.g. Splunk, MongoDB), sharding, replication, and query optimization.
- Minimum of 3 years’ experience with managing and/or monitoring highly available and fault tolerant platforms.
- Experience in implementing Synthetic Monitors using Playwright and JavaScript.
- Experience in conducting Disaster recovery best practices and tabletop exercises.
- Implement FinOps practices to optimize cost based on industry best practices and understand cost drivers.
- Experience leveraging Machine Learning for monitoring key performance indicators (KPIs) that may impact SLAs and SLOs.
- Experience creating Kibana visualizations, dashboards and alerts for real-time and historical insights.
- Experience in leveraging REST APIs through programming languages such as bash, python, etc.
- Knowledge of Elastic Common Schema or similar (e.g. OTel, CIM, CEF, OCIF).
- Proficiency and knowledge of Elasticsearch's cross-cluster search (CCS) feature