Skip to content

joesghub/ec2-remediation-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Netflix and ServiceNow Failing Instance Remediation System

As a ServiceNow Admin and Jr Developer at Netflix, I built a semi-automated incident response system to help the DevOps engineer team quickly remediate failing AWS EC2 instances, protecting streaming quality for millions of viewers. The system combines monitoring, AI-driven guidance, Slack notifications, and one-click remediation to reduce downtime and manual effort.

System Overview

Netflix DevOps teams managing critical AWS EC2 infrastructure faced a persistent challenge: unnoticed EC2 instance failures in the US-East region caused up to 45 minutes of streaming downtime, risking subscriber satisfaction and retention.

The EC2 Remediation System was designed to address this challenge by integrating ServiceNow workflows, AWS monitoring, AI-guided knowledge retrieval, and proactive Slack notifications. With one-click remediation, the system provides:

  • Rapid incident response: Accelerates the detection and remediation of failing EC2 instances.
  • Customer experience protection: Minimizes streaming disruptions to maintain high subscriber satisfaction.
  • Operational efficiency: Reduces repetitive manual work for DevOps engineers.
  • Business resilience: Mitigates potential revenue loss from downtime and infrastructure issues.

Business Outcomes

The EC2 Remediation System delivers measurable impact for Netflix through operational metrics and automation:

  • Time-to-remediate reduction: From up to 45 minutes to 2–6 minutes per incident.
  • Automation coverage: ~70% of incidents handled without manual intervention via workflows and one-click remediation.
  • Knowledge scaling: AI Search integration retrieves relevant KB articles in seconds, reducing reliance on tribal knowledge.
  • Incident visibility: Slack alerts and Flow Designer metrics provide real-time status updates and operational insights.
  • Reliability improvements: Standardized workflows ensure predictable and consistent incident resolution.
  • Customer impact mitigation: Minimizes the number of viewers affected during EC2 failures, protecting Netflix’s brand and subscription base.

Tools and Technologies Used

Tool / Technology Purpose / Role Business Value
ServiceNow Platform Custom tables, Scoped App, UI Action, Script Include, Flow Designer, AI Search integration, and system logs. Centralized, auditable platform that accelerates incident response and supports operational compliance.
AWS Integration Server Feeds real-time EC2 instance data into ServiceNow. Enables immediate detection of infrastructure issues, reducing downtime and customer impact.
Flow Designer Orchestrates incident creation, AI Search, and Slack notifications. Streamlines workflows and reduces manual effort, improving mean time to resolution (MTTR).
AI Search Custom Action Retrieves relevant KB articles during incident workflows. Scales institutional knowledge, giving engineers instant access to remediation guidance.
Slack Webhook Sends real-time alerts and remediation guidance to the DevOps team channel. Enables faster response and proactive mitigation, protecting service availability.
UI Action (trigger_EC2_Remediation.js) One-click EC2 remediation from the form. Simplifies recovery steps, reduces errors, and accelerates incident resolution.
Script Include (EC2RemediationHelper.js) Executes API calls from ServiceNow to AWS for remediation. Automates critical tasks, ensuring consistent and reliable infrastructure recovery.
Draw.io Visualizes system architecture. Supports stakeholder alignment and easier onboarding of new team members.
ServiceNow Update Set Captures and transports configured artifacts (ec2-remediation-system.xml). Ensures repeatable, low-risk deployment across environments.
Knowledge Base Articles Documents remediation steps with AI-discoverable keywords. Preserves institutional knowledge and enables AI-assisted decision-making during incidents.

Architecture Diagram

Visual representation of the complete workflow

Implementation Steps

Step 1: Application Setup

I used ServiceNow Studio to create and manage my application. It bundles your application components together by scope.

The Studio saved me a lot of time and kept me in control of my application development process.

Creating a scoped application

Step 2: Table Setup

EC2 Instance Table

Tracks the status of our EC2 instance. This table is connected to the Integration Server through an API.

The server is sending POST and PUT requests to our table. EC2 Instance Table

Remediation Log Table

Tracks the status of our EC2 instance Remediation Attempts. This table is connected to the Integration Server through an API.

The server is sending POST requests to our table. Remediation Log Table

Step 3: AWS Integration Configuration

ServiceNow Connection & Credential Alias

  • Connection Alias: A shortcut name that points to a connection (like an API endpoint).

  • Credential Alias: A shortcut name that points to credentials (like a username, password, or token).

Instead of hardcoding connection or credentials details in my integration logic, I can reference the alias. This makes it easy to swap out the actual connection or credentials later without changing the code.

I think of aliases as labels that keep your integrations flexible and reusable. ServiceNow Connection & Credential Alias

ServiceNow HTTP Connection

Connection: The destination details of a URL or endpoint my integration will talk to. ServiceNow HTTP Connection

ServiceNow Credential Records (Type: Basic Auth)

Credential: The authentication record (username/password or token) used to authenticate a connection. ServiceNow Credential Records (Type: Basic Auth)

Step 4: UI Action and Script Include Implementation

UI Action Configuration

The trigger_EC2_Remediation client-side function runs from a form, grabs the current record’s sys_id, and calls the EC2RemediationHelper Script Include via GlideAjax.

It alerts the user whether the remediation request succeeded or failed, then reloads the form so they can see updated information in the Remediation Log. UI Action Configuration

Script Include Configuration

The EC2RemediationHelper script exposes a triggerRemediation function that takes an EC2 instance sys_id, looks up the corresponding record, retrieves AWS connection details, and makes a REST API call to restart the instance.

It logs the request, response, and any errors to a remediation log table, then returns a JSON result with success status, messages, and metadata. Script Include Configuration

Step 5: Flow Designer Workflow Creation

Single Flow Designer Worklflow Overview Single Flow Designer Workflow Overview

Workflow Trigger

Trigger: Record created or updated on the EC2 Instance table with an OFF Instance Status Flow trigger

Workflow Actions

Action 1: Create an Incident Record Create an Incident Record

Action 2: AI Search Custom AI Search Custom

Action 3a: Accessing Flow Variables

Accessing Flow Variables

Action 3b: Defining Flow Variables

Defining Flow Variables

Action 4: Slack EC2 Instance Service Alert

You can see the data pills I used from the flow variables that include hyperlinks.

Slack EC2 Instance Service Alert

Action 5 and 6: Do and Wait Until

This step allows the flow to wait for the Trigger EC2 Remediation UI Action to complete.

Do and Wait Until

Trigger Remediation UI Action

Here’s the button on the EC2 Instance record.

Trigger Remediation UI Action

Action 7: Instance Status is On Trigger

When we receive the updated instance status, we can update the related Incident record.

Instance Status is On Trigger

Action 8: Update the Incident Record

Update the Incident Record

Action 9a: Slack EC2 Instance Resolution Alert

Slack EC2 Instance Resolution Alert

Action 9b: Alert Variable Division Function

To enhance the Slack message, I performed two functions on the Resolve Time data pill.

Alert Variable Division Function

Action 9c: Alert Variable Rounding Function

First, I divided the Resolve Time (in seconds) by 60 to convert it to minutes. Then I rounded the result for a cleaner display.

Alert Variable Rounding Function

Step 6: Knowledge Base Content

Knowledge Base

I chose the Knowledge Base "Knowledge", which led to some challenges down the line.

I was familiar with Search Applications, Profiles, Sources, and Article Publishing, so I thought I could easily make my article searchable in the AI Search Custom.

However, I quickly realized I needed to better understand how the script controlling the AI Search Custom was engineered.

Knowledge Base

Knowledge Article

I added keywords to my article to improve its quality and test the AI Search capabilities.

Knowledge Article

Step 7: AI Search Integration

Customizing AI Search Custom Code to Include KB Article Links

As I got a handle on my Slack message output, I wanted the customized AI Search integration script (referred to here as AI Search Custom) results to match my earlier formatting.

I reviewed the script handling AI Search Custom logic. The highlighted section categorized Knowledge Base results.

To construct the article URL, I updated the code to build the link dynamically:

article.link = baseUrl + article.table + ".do?sys_id=" + article.sysId

Where:

  • article.table pulls the table name.
  • article.sysId pulls the record sys_id.

I then embedded the link into the article number:

article.number = "<" + article.link + "|" + article.number + ">"

This allowed me to reuse the new article.number variable within the existing code flow.

Customizing AI Search Custom Code to include KB Article Links

Identifying How Search Application Was Chosen

The AI Search Custom script was well organized, with helpful section headings and comments.

This made debugging easier, but I noticed an issue: my article wasn’t being found, even when I entered the name of an existing Search Application.

Upon checking the logs, I realized the script was defaulting to a fallback search.

Identifying how Search Application chosen by AI Search Custom

Confirming Which Search App Was Used

In the input (blue), I entered "Knowledge Portal Search Configuration".

But the output (green) showed the system used "Service Portal Default Search Application" instead.

That’s when it clicked! The script couldn’t find my app and activated the fallback clause:

// Fallback to any search config containing 'Search'
searchConfigGR.addQuery('name', 'CONTAINS', 'Search')

Confirming "Search App Used" by AI Search Custom

Reviewing "Service Portal Default Search Application"

Since I was already familiar with Search Applications, I modified the Service Portal Default Search Application configuration instead of changing the script.

This approach saved time and reduced bugs in testing.

Reviewing "Service Portal Default Search Application" Configuration

Reviewing "Service Portal Default Search Profile"

The Search Profile showed me the Search Sources tied to the Service Portal Default Search Application.

Reviewing "Service Portal Default Search Profile" Configuration

Updating "Service Portal Knowledge Base Search Source"

I had been advised to move my article into the "IT" Knowledge Base since it was the only one the AI Search Custom could find.

That advice worked at the time, but now I finally understood why:

The Service Portal Default Search Application didn’t include the "Knowledge" KB as a source!

After adding the "Knowledge" KB to the Search Sources, the article became searchable.

Updating "Service Portal Knowledge Base Search Source" Configuration

Verifying Knowledge Article Retrieval

After the update, I previewed the results and confirmed my Knowledge Article was now included in the retrievals.

Verifying KB Article Retrieved by updated "Service Portal Knowledge Base Search Source"

Step 8: Testing and Validation

Final Slack Notifications

Here is the final version of my EC2 Remediation System alerts.

I had a vision and enjoyed bringing it to life!

Final Slack Notifications

Final EC2 Instance Table

After 70 updates, I was able to reliably track our instance status.

Final EC2 Instance Table

Final Remediation Log Table

Final Remediation Log Table

Final Incidents

Final Incidents

Final Slack Logs

Final Slack Logs

Optimization

Area Improvement / Feature Business Value Business Value Category
Flow Improvements Flow Variables Reduces manual errors and accelerates incident handling. Operational Efficiency
Do and Wait Until Trigger Ensures accurate status updates, improving reliability and visibility. Reliability / Accuracy
Record Resolution Automatically updates related Incident records, reducing follow-up work for engineers. Operational Efficiency
Notification Insights Captures workflow metrics, enabling process improvements and faster response times. Insights & Continuous Improvement
AI Search Custom Article Linking Connects incidents to relevant Knowledge Base articles, accelerating problem resolution. Knowledge Management
Expanding Search Sources Ensures critical guidance is discoverable, improving MTTR (Mean Time to Repair) and scaling knowledge sharing. Knowledge Management

DevOps Usage

As a ServiceNow Admin and Jr Developer at Netflix, I designed this system for DevOps engineers to rapidly detect and remediate EC2 instance failures:

  • Rapid response: Reduces the time to identify and remediate failing instances from up to 45 minutes to just a few minutes.
  • Consistent remediation: One-click Trigger EC2 Remediation UI Action ensures predictable, error-free recovery.
  • Real-time visibility: Slack notifications and updated Incident records keep engineers informed of progress and status changes.
  • Operational insights: Metrics from Flow Designer allow engineers to monitor reset times and identify opportunities for standardization (currently 2–6 minutes).
  • Reduced manual workload: Automation frees engineers to focus on higher-value tasks instead of repetitive incident management.

🤝🏾 Connect With Me

About

As a ServiceNow Admin and Jr Developer at Netflix, I built a semi-automated incident response system to help the DevOps engineer team quickly remediate failing AWS EC2 instances, protecting streaming quality for millions of viewers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors