SRE Technical Manager
Company: Leidos
Location: Washington
Posted on: November 11, 2024
|
|
Job Description:
Description More About the Role:Leidos currently has an opening
on the Service Management Integration and Transport (SMIT) Contract
for a Site Reliability Engineering (SRE) Technical Manager. This is
an exciting opportunity to use your experience and leadership
skills to successfully execute the mission of the Navy's largest IT
services program. Under the SMIT Contract, the Leidos team is
responsible for the core backbone for the Navy-Marine Corps
Intranet, including cybersecurity services, network operations,
network engineering, service desk, seat support services, and data
transport.We are seeking a highly skilled and experienced SRE
Technical Manager to lead our Data Center Site Reliability
Engineering (SRE) team. In this role, you will manage a group of
talented engineers responsible for ensuring the reliability,
performance, and scalability of critical systems across 6-8 SRE
Pods. You will work closely with engineering, product, and
operations teams to implement best practices in automation,
incident management, and system monitoring. This role will focus on
both the strategic and operational aspects of site reliability,
ensuring that the team meets performance objectives while fostering
a culture of innovation and continuous improvement. The SRE
Technical Manager will collaborate with the Director of Site
Reliability Engineering and is responsible for supporting,
migrating, automation and optimization of software development and
deployment process, infrastructure as code, and maturing the Site
Reliability Engineering program. The manager will mentor and coach
lower level technical staff performing collaborative code reviews
to strengthen the SRE skills across the teams.What You'll Get to
Do:--- Manage and mentor 6-8 SRE teams (pods) and 60+ FTEs,
providing guidance, setting performance expectations, and fostering
professional development.--- Work collaboratively with SRE Resource
Managers to staff and maintain engineering resources for your SRE
vertical teams' reliability and scalability goals.--- Responsible
for the P&L across the Data Center Services vertical. Manage
the SRE team's resources, including budget planning, tool
selection, and infrastructure investments to meet reliability and
scalability needs.--- Meet regularly with your team members,
participate in performance reviews and interviews, and development
planning.--- Oversee the reliability, availability, and performance
of critical systems by leading the SRE teams within the data center
vertical in implementing monitoring, incident response, and
performance optimization strategies.--- Ensure the team adheres to
best practices for system reliability, automation, and operational
efficiency.--- Drive continuous improvement initiatives by
analyzing performance metrics (e.g., SLOs, MTTR, MTBF) and
identifying areas for enhancement.--- Collaborate with operations,
quality, cybersecurity and other SRE engineering teams to define
and enforce Service Level Objectives (SLOs) and manage error
budgets.--- Act as a liaison between the SRE team and other
departments to prioritize reliability and operational needs in the
product development process.--- Collaborate with senior leadership
to define the SRE strategy, set long-term reliability goals, and
ensure alignment with business objectives.--- Lead efforts to
reduce operational toil through automation. Work with the team to
build or enhance automation tools that manage infrastructure,
monitor systems, and respond to incidents.--- Oversee the
development and adoption of Infrastructure as Code (IaC) tools,
CI/CD pipelines, and other automation processes.--- Ensure that SRE
practices align with organizational security policies and
compliance requirements.--- Collaborate with security teams to
integrate reliability-focused security practices into the design
and operation of systems.--- Ensure systems meet or exceed
agreed-upon service levels by proactively addressing potential
issues and working with stakeholders to align on reliability
expectations.--- Work within a SRE team, collaborating with other
Developers, Security, and Operations, to continuously deliver
products and increase the value stream for the organization and
customers.--- Embrace and champion Agile development processes and
adoption to modern Site Reliability Engineering workflows and
practices while providing technical guidance to team members and
coworkers on best practices.--- Stay up to date on the latest Site
Reliability Engineering practices and technologies.--- Strive to
provide internal and external customers with excellent customer
service and world-class service.--- Resolve most conflicts between
timeline, budget, and scope independently but intuitively raise
sophisticated or consequential issues to senior management.You'll
Bring These Qualifications:--- Requires BS degree (or equivalent)
in Cybersecurity, Information Security, IT, Network Engineering,
Computer Science, or related field or Master's with 6+ years of
prior relevant experience with 8-10 years of SRE or DevOps
experience and at least 4 years in a leader or manager capacity.---
US Citizen with DoD Secret Clearance.--- Minimum of DoD 8570.01 IAT
Level II Certification required prior to onboarding and must
maintain certification while supporting the SMIT Contract.--- Must
be able to support program execution in classified environments and
access SIPRNet from an NMCI location on short notice (local
travel).--- Exceptional written and oral communication skills
including producing technical analysis/reports, presentations and
executive level briefings with internal and external
stakeholders.--- Ability to review requirements, comprehend, and
solution capabilities that satisfy customer requirements.---
Ability to work in a highly collaborative, forward thinking, and
innovation-driven environment.--- Proven experience managing teams
responsible for large-scale, distributed systems with high
reliability and performance demands.--- Strong track record of
managing incidents, conducting postmortems, and implementing
reliability improvements.--- Experience implementing and managing
Agile or DevOps processes, with a focus on continuous improvement,
efficiency, and team productivity.--- Ability to lead teams through
strategic initiatives such as reliability maturity assessments,
process automation, and tooling selection.--- Solid understanding
of SRE principles, including Service Level Objectives (SLOs),
Service Level Indicators (SLIs), and error budgeting.--- Experience
with commercial cloud infrastructure deployment environments such
as AWS and Azure.--- Strong knowledge of automation tools, CI/CD
pipelines, and Infrastructure as Code (IaC).--- Experience with
Agile and DevSecOps/SRE concepts and best practices.--- Hand-on
experience with Atlassian products (Jira, Confluence, Bitbucket,
etc.).--- Experience creating JIRA and/or Azure DevOps workflows,
projects, custom configurations.--- Solid experience with
integrating/maintaining with various 3rd party CI/CD tools like
Jenkins and Gitlab.--- Experience with automated provisioning and
configuration tools like Terraform, Cloud Formation, Ansible, or
similar technologies.--- Basic Linux skills supporting Red Hat
Enterprise Linux (RHEL).--- Working knowledge of the Risk
Management Framework (RMF), DISA STIGs.These Qualifications Would
be Nice to Have:--- Previous work experience providing support to
the NGEN-NMCI program is highly desired.--- Previous technical
people leadership experience of 8 or more FTEs.--- Experience with
microservices architecture and distributed systems.--- Familiarity
with serverless and event-driven architectures.--- Certification in
cloud platforms (e.g., Azure Certified DevOps Engineer).---
Experience in high-growth environments or managing teams during
significant scaling periods.--- ITILv4 and Agile SAFe
certifications or applicable experience.Original Posting
Date:2024-11-08While subject to change based on business needs,
Leidos reasonably anticipates that this job requisition will remain
open for at least 3 days with an anticipated close date of no
earlier than 3 days after the original posting date as listed
above.Pay Range:Pay Range $108,550.00 - $196,225.00The Leidos pay
range for this job level is a general guideline only and not a
guarantee of compensation or salary. Additional factors considered
in extending an offer include (but are not limited to)
responsibilities of the job, education, experience, knowledge,
skills, and abilities, as well as internal equity, alignment with
market data, applicable bargaining agreement (if any), or other
law.#Remote
#J-18808-Ljbffr
Keywords: Leidos, Potomac , SRE Technical Manager, IT / Software / Systems , Washington, Maryland
Click
here to apply!
|