For roles based in London, your contractual place of work will be Stratford. While the Stratford site is expected to become operational from November 2025 March 2026, you will be required to carry out your contractual duties from Vauxhall or another reasonable location on a temporary basis during the interim period. Please note that, as Stratford will be your contractual place of work, any subsequent move from a temporary location will not entitle you to payments for travel time or costs under the Relocation and Excess Travel Policy.
Job summary
As a senior member of the National Data Exploitation Capability (NDEC), you will lead the technical delivery, reliability and performance of advanced applications and services that underpin critical operational activity. You will drive the development and maintenance of resilient, scalable and high‑performing systems, ensuring they consistently meet operational demand. Working at the forefront of the Agencys data and technology landscape, you will shape engineering approaches, champion automation and observability, and enable teams to deliver secure, robust and dependable services.
Job description
As the Lead Site Reliability Engineer, you will provide strategic and technical leadership for the site reliability engineering (SRE) function within the National Data Exploitation Capability (NDEC). You will lead an Agile, multi‑disciplinary team responsible for designing, implementing and operating the applications and services that support critical analytical and operational outcomes across the NCA. Your remit will include ensuring the reliability, resilience, capacity, availability and performance of these services in line with demanding operational needs.
You will act as NDECs subject matter expert for all aspects of Site Reliability Engineering, setting technical direction, driving adoption of SRE best practice and providing expert guidance to specialist teams. You will lead efforts to build and maintain stable, secure and scalable systems, using data and observability to anticipate issues, reduce operational toil and improve service performance.
A key part of your role will be championing automation and modern engineering approaches to streamline processes, accelerate delivery and enhance system reliability. You will promote a culture of continuous improvement, using monitoring, performance insights and stakeholder feedback to identify opportunities to optimise systems, strengthen service resilience and improve user experience. You will also work closely with engineers, architects, product managers and operational teams to ensure that services are designed and operated in a reliable, maintainable and cost‑effective way.
Through strong collaboration, effective leadership and a deep understanding of SRE principles, you will play a pivotal role in ensuring NDECs platforms and services remain robust, responsive and able to support mission‑critical operations in a fast‑moving and complex environment.
Duties and Responsibilities
Delivery - Lead and oversee the end‑to‑end delivery of high‑quality software applications and services, from design and testing through to implementation, operation and ongoing support, ensuring they meet reliability, performance and availability requirements.
Quality Assurance - Ensure all solutions are secure by design, compliant with regulatory, security and architectural standards, and aligned with best practice across engineering, operational and governance domains.
Subject Matter Expertise - Act as the SRE subject matter expert on tooling, technologies and engineering practices, including Infrastructure as Code, CI/CD pipelines, observability tooling and containerisation, ensuring these are applied effectively to improve scalability, resilience and operational efficiency.
Monitoring & Observability - Lead the implementation and continuous improvement of monitoring and observability capabilities. Ensure deployed applications and services are actively monitored, and that availability targets are met through effective alerting, diagnostics and operational insight.
Automation - Drive automation initiatives, establishing processes to identify manual or repetitive tasks, and applying automation to reduce operational effort, improve consistency and enhance service reliability.
Incident Management - Lead the detection, diagnosis and resolution of incidents and problems in collaboration with the Service Manager. Ensure effective incident response processes, rapid escalation, clear communication and timely remediation actions.
Scalability & Capacity Planning - Plan for and manage capacity across services and platforms to ensure systems can scale reliably in response to operational and user demand, mitigating performance or stability risks.
Troubleshooting & Problem Resolution - Lead post‑incident reviews and root cause analysis, directing the implementation of lessons learned and longer‑term improvements to prevent recurrence and strengthen system resilience.
Leadership - Provide strong leadership to the NDEC Site Reliability Engineering team, ensuring teams deliver reliable, scalable and secure services throughout the entire software lifecycle. Mentor and develop junior SREs and foster a culture of collaboration, learning and excellence.
Innovation - Stay up to date with emerging industry trends, technologies and SRE practices. Evaluate new tools and techniques to enhance automation, observability and overall service reliability, and guide their adoption where beneficial.
Communication & Collaboration - Communicate clearly and confidently with senior leaders, translating technical issues, risks and dependencies into clear operational or organisational impacts. Ensure the SRE team collaborates effectively with engineers, architects, specialists and operational stakeholders to maintain high‑quality service delivery.
Person specification
Availability & Capacity Management - Ability to lead teams in the design, deployment, monitoring and support of services to ensure they meet availability, reliability and scalability requirements. Experience planning and managing capacity to ensure systems and services scale effectively in response to operational demand.
Coding, Scripting & Infrastructure as Code - Ability to write, read and maintain Infrastructure as Code solutions (e.g., Terraform) and work confidently with containerisation technologies such as Docker. Experience applying automation, scripting and configuration management to improve repeatability and reduce operational effort.
Modern Development Standards & DevOps Practices - Strong understanding of modern development standards, including the use of CI/CD pipelines (e.g. GitLab) and automated build/deployment processes. Ability to lead others in adopting modern engineering practices, including containerisation best practice and developing skills or interest in Kubernetes. Experience delivering and maintaining scalable applications using CI/CD, IaC and virtualisation technologies such as VMware (or equivalent).
Problem & Incident Management - Experience identifying, investigating and resolving root causes of incidents and recurring problems, using data to identify patterns and trends. Ability to collaborate with specialists to determine appropriate resolutions, implement preventative measures and drive continuous improvement.
Systems Design & Integration - Ability to review and assure system designs to ensure appropriate technology choices, efficient use of resources and integration across multiple platforms, including virtualisation environments such as VMware. Experience designing or supporting complex, distributed systems, ensuring they are resilient, scalable and secure.
Technical Leadership & SME Expertise - Ability to anticipate technology trends, advise on future opportunities and set direction for tooling, standards and best practice across the SRE function. Demonstrable experience providing technical leadership and mentorship, supporting skill development and capability growth within the team. Experience leading the delivery and lifecycle management of high‑quality, reliable applications and services.
Cloud Engineering - Experience developing, deploying and supporting cloud‑based applications (preferably Amazon Web Services). Understanding of cloud‑native architectures, operational models and security considerations.
Performance & Service Management - Experience monitoring and managing the performance of applications and services to ensure they meet operational and user‑driven demand. Ability to lead post‑incident reviews, direct improvements and ensure stable, high‑quality service operation.
Communication & Stakeholder Engagement - Ability to communicate complex technical information clearly and confidently, adapting style for senior leadership audiences. Experience escalating risks, translating technical issues into business impacts and ensuring decisions are well understood. Ability to lead collaborative working with engineers, architects and operational stakeholders to maintain service quality.
Behaviours
We'll assess you against these behaviours during the selection process:
- Seeing the Big Picture
- Leadership
- Managing a Quality Service
New entrants to the NCA receive 26 days annual leave, rising to 31 on completion of 5 years continuous service, plus 8 bank holidays.
If qualifying criteria is met new joiners from UK Police Forces or the UK Intelligence Community (UKIC) will have service with those employers taken into account for continuous service purposes for annual leave entitlement only, this will be up to a maximum of 31 days leave (including 1 privilege day).
Other benefits include:
- Flexible working, including flexi-time, compressed hours and job sharing (in line with business requirements)
- Family friendly policies, notably above the statutory minimum
- Learning and Development opportunities
- Interest free loans and advances, including season tickets, childcare and rental deposits
- Housing schemes - Key Worker status
- Discounts and Savings with a wide variety of services including Cycle to Work, Smart Tech schemes, dental insurance, gym discounts and savings on everyday spending, available through the Reward Gateway , Edenred and Blue Light Card schemes.
- Staff support groups/networks
- Sports and social activities, including membership to the Civil Service Sports Council (CSSC)
Further information is available on the NCA Website.
Artificial intelligence
Selection process details
CV
Please include your full career history, training, qualifications, key responsibilities, and achievements. Explain any employment gaps in the last two years. Ensure all accreditation dates are accurate.
Details of what is expected within you CV are as follows: Please provide a high‑level summary of your relevant career history, highlighting the roles, environments and levels of responsibility that demonstrate your ability to operate effectively in a context comparable to this position and meet the criteria listed in the Person Specification.
- Designing, automating and managing highly reliable and scalable distributed systems.
- Hands‑on leadership in continuous integration and continuous deployment (CI/CD), container orchestration, and modern DevOps and Site Reliability Engineering practices.
- Proven experience in incident response, root cause analysis and leading reliability improvement initiatives to enhance the stability and performance of services.
Longlist
In the event of a high number of applications, we may operate a longlist. Applicants will need to meet the minimum pass mark for the lead criteria.
- Designing, automating and managing highly reliable and scalable distributed systems. .
Candidates who do not meet the minimum pass mark for the lead criteria will not progress to having their other criteria assessed. Applications must meet the minimum criteria to be progressed to the assessment stage.
You will receive an acknowledgement once your application is submitted.
We aim to have sift completed and scores released within 10 working days of the closing date of the advert. For high volume campaigns this timeframe may be extended.
Scores will be provided but further feedback will not be available at this stage.
NCA Applying and Onboarding
Assessment 1
The format of this assessment will be Interview which will be tested on the criteria listed in the Success Profiles at Assessment section.
Success Profiles at Assessment
- Seeing the Big Picture
- Leadership
- Managing a Quality Service
- Designing, automating and managing highly reliable and scalable distributed systems.
- Hands‑on leadership in continuous integration and continuous deployment (CI/CD), container orchestration, and modern DevOps and Site Reliability Engineering practices.
- Proven experience in incident response, root cause analysis and leading reliability improvement initiatives to enhance the stability and performance of services.
If successful but no role is immediately available, you may be placed on a reserve list for 12 months.
Reserve lists can be used to fill similar role types across the Agency where the assessment criteria is considered a match by the recruitment team and the business area.
In the event of a tie at the assessment stage, available roles will be offered in merit order using the following order:
- Lead criteria (behaviours/technical/experience)
- If still tied, desirable criteria will be assessed (if advertised)
- If still tied, application sift scores will be used
Feedback will only be provided if you attend an interview or assessment.
Security
Medical
Nationality requirements
Working for the Civil Service
We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment principles (opens in a new window).
We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment undefined (opens in a new window).
Diversity and Inclusion
Contact point for applicants
Job contact :
- Name : central.recruitment@nca.gov.uk
- Email : central.recruitment@nca.gov.uk
- Telephone : central.recruitment@nca.gov.uk
Recruitment team
- Email : central.recruitment@nca.gov.uk
