Netflix and ServiceNow Failing Instance Remediation System

As a ServiceNow Admin and Jr Developer at Netflix, I built a semi-automated incident response system to help the DevOps engineer team quickly remediate failing AWS EC2 instances, protecting streaming quality for millions of viewers. The system combines monitoring, AI-driven guidance, Slack notifications, and one-click remediation to reduce downtime and manual effort.

System Overview

Netflix DevOps teams managing critical AWS EC2 infrastructure faced a persistent challenge: unnoticed EC2 instance failures in the US-East region caused up to 45 minutes of streaming downtime, risking subscriber satisfaction and retention.

The EC2 Remediation System was designed to address this challenge by integrating ServiceNow workflows, AWS monitoring, AI-guided knowledge retrieval, and proactive Slack notifications. With one-click remediation, the system provides:

Rapid incident response: Accelerates the detection and remediation of failing EC2 instances.
Customer experience protection: Minimizes streaming disruptions to maintain high subscriber satisfaction.
Operational efficiency: Reduces repetitive manual work for DevOps engineers.
Business resilience: Mitigates potential revenue loss from downtime and infrastructure issues.

Business Outcomes

The EC2 Remediation System delivers measurable impact for Netflix through operational metrics and automation:

Time-to-remediate reduction: From up to 45 minutes to 2–6 minutes per incident.
Automation coverage: ~70% of incidents handled without manual intervention via workflows and one-click remediation.
Knowledge scaling: AI Search integration retrieves relevant KB articles in seconds, reducing reliance on tribal knowledge.
Incident visibility: Slack alerts and Flow Designer metrics provide real-time status updates and operational insights.
Reliability improvements: Standardized workflows ensure predictable and consistent incident resolution.
Customer impact mitigation: Minimizes the number of viewers affected during EC2 failures, protecting Netflix’s brand and subscription base.

Tools and Technologies Used

Tool / Technology	Purpose / Role	Business Value
ServiceNow Platform	Custom tables, Scoped App, UI Action, Script Include, Flow Designer, AI Search integration, and system logs.	Centralized, auditable platform that accelerates incident response and supports operational compliance.
AWS Integration Server	Feeds real-time EC2 instance data into ServiceNow.	Enables immediate detection of infrastructure issues, reducing downtime and customer impact.
Flow Designer	Orchestrates incident creation, AI Search, and Slack notifications.	Streamlines workflows and reduces manual effort, improving mean time to resolution (MTTR).
AI Search Custom Action	Retrieves relevant KB articles during incident workflows.	Scales institutional knowledge, giving engineers instant access to remediation guidance.
Slack Webhook	Sends real-time alerts and remediation guidance to the DevOps team channel.	Enables faster response and proactive mitigation, protecting service availability.
UI Action (`trigger_EC2_Remediation.js`)	One-click EC2 remediation from the form.	Simplifies recovery steps, reduces errors, and accelerates incident resolution.
Script Include (`EC2RemediationHelper.js`)	Executes API calls from ServiceNow to AWS for remediation.	Automates critical tasks, ensuring consistent and reliable infrastructure recovery.
Draw.io	Visualizes system architecture.	Supports stakeholder alignment and easier onboarding of new team members.
ServiceNow Update Set	Captures and transports configured artifacts (`ec2-remediation-system.xml`).	Ensures repeatable, low-risk deployment across environments.
Knowledge Base Articles	Documents remediation steps with AI-discoverable keywords.	Preserves institutional knowledge and enables AI-assisted decision-making during incidents.

Architecture Diagram

Implementation Steps

Step 1: Application Setup

I used ServiceNow Studio to create and manage my application. It bundles your application components together by scope.

The Studio saved me a lot of time and kept me in control of my application development process.

Step 2: Table Setup

EC2 Instance Table

Tracks the status of our EC2 instance. This table is connected to the Integration Server through an API.

The server is sending POST and PUT requests to our table.

Remediation Log Table

Tracks the status of our EC2 instance Remediation Attempts. This table is connected to the Integration Server through an API.

The server is sending POST requests to our table.

Step 3: AWS Integration Configuration

ServiceNow Connection & Credential Alias

Connection Alias: A shortcut name that points to a connection (like an API endpoint).
Credential Alias: A shortcut name that points to credentials (like a username, password, or token).

Instead of hardcoding connection or credentials details in my integration logic, I can reference the alias. This makes it easy to swap out the actual connection or credentials later without changing the code.

I think of aliases as labels that keep your integrations flexible and reusable.

ServiceNow HTTP Connection

Connection: The destination details of a URL or endpoint my integration will talk to.

ServiceNow Credential Records (Type: Basic Auth)

Credential: The authentication record (username/password or token) used to authenticate a connection.

Step 4: UI Action and Script Include Implementation

UI Action Configuration

The trigger_EC2_Remediation client-side function runs from a form, grabs the current record’s sys_id, and calls the EC2RemediationHelper Script Include via GlideAjax.

It alerts the user whether the remediation request succeeded or failed, then reloads the form so they can see updated information in the Remediation Log.

Script Include Configuration

The EC2RemediationHelper script exposes a triggerRemediation function that takes an EC2 instance sys_id, looks up the corresponding record, retrieves AWS connection details, and makes a REST API call to restart the instance.

It logs the request, response, and any errors to a remediation log table, then returns a JSON result with success status, messages, and metadata.

Step 5: Flow Designer Workflow Creation

Single Flow Designer Worklflow Overview

Workflow Trigger

Trigger: Record created or updated on the EC2 Instance table with an OFF Instance Status

Workflow Actions

Action 1: Create an Incident Record

Action 2: AI Search Custom

Action 3a: Accessing Flow Variables

Action 3b: Defining Flow Variables

Action 4: Slack EC2 Instance Service Alert

You can see the data pills I used from the flow variables that include hyperlinks.

Action 5 and 6: Do and Wait Until

This step allows the flow to wait for the Trigger EC2 Remediation UI Action to complete.

Trigger Remediation UI Action

Here’s the button on the EC2 Instance record.

Action 7: Instance Status is On Trigger

When we receive the updated instance status, we can update the related Incident record.

Action 8: Update the Incident Record

Action 9a: Slack EC2 Instance Resolution Alert

Action 9b: Alert Variable Division Function

To enhance the Slack message, I performed two functions on the Resolve Time data pill.

Action 9c: Alert Variable Rounding Function

First, I divided the Resolve Time (in seconds) by 60 to convert it to minutes. Then I rounded the result for a cleaner display.

Step 6: Knowledge Base Content

Knowledge Base

I chose the Knowledge Base "Knowledge", which led to some challenges down the line.

I was familiar with Search Applications, Profiles, Sources, and Article Publishing, so I thought I could easily make my article searchable in the AI Search Custom.

However, I quickly realized I needed to better understand how the script controlling the AI Search Custom was engineered.

Knowledge Article

I added keywords to my article to improve its quality and test the AI Search capabilities.

Step 7: AI Search Integration

Customizing AI Search Custom Code to Include KB Article Links

As I got a handle on my Slack message output, I wanted the customized AI Search integration script (referred to here as AI Search Custom) results to match my earlier formatting.

I reviewed the script handling AI Search Custom logic. The highlighted section categorized Knowledge Base results.

To construct the article URL, I updated the code to build the link dynamically:

article.link = baseUrl + article.table + ".do?sys_id=" + article.sysId

Where:

article.table pulls the table name.
article.sysId pulls the record sys_id.

I then embedded the link into the article number:

article.number = "<" + article.link + "|" + article.number + ">"

This allowed me to reuse the new article.number variable within the existing code flow.

Identifying How Search Application Was Chosen

The AI Search Custom script was well organized, with helpful section headings and comments.

This made debugging easier, but I noticed an issue: my article wasn’t being found, even when I entered the name of an existing Search Application.

Upon checking the logs, I realized the script was defaulting to a fallback search.

Confirming Which Search App Was Used

In the input (blue), I entered "Knowledge Portal Search Configuration".

But the output (green) showed the system used "Service Portal Default Search Application" instead.

That’s when it clicked! The script couldn’t find my app and activated the fallback clause:

// Fallback to any search config containing 'Search'
searchConfigGR.addQuery('name', 'CONTAINS', 'Search')

Reviewing "Service Portal Default Search Application"

Since I was already familiar with Search Applications, I modified the Service Portal Default Search Application configuration instead of changing the script.

This approach saved time and reduced bugs in testing.

Reviewing "Service Portal Default Search Profile"

The Search Profile showed me the Search Sources tied to the Service Portal Default Search Application.

Updating "Service Portal Knowledge Base Search Source"

I had been advised to move my article into the "IT" Knowledge Base since it was the only one the AI Search Custom could find.

That advice worked at the time, but now I finally understood why:

The Service Portal Default Search Application didn’t include the "Knowledge" KB as a source!

After adding the "Knowledge" KB to the Search Sources, the article became searchable.

Verifying Knowledge Article Retrieval

After the update, I previewed the results and confirmed my Knowledge Article was now included in the retrievals.

Step 8: Testing and Validation

Final Slack Notifications

Here is the final version of my EC2 Remediation System alerts.

I had a vision and enjoyed bringing it to life!

Final EC2 Instance Table

After 70 updates, I was able to reliably track our instance status.

Final Remediation Log Table

Final Incidents

Final Slack Logs

Optimization

Area	Improvement / Feature	Business Value	Business Value Category
Flow Improvements	Flow Variables	Reduces manual errors and accelerates incident handling.	Operational Efficiency
	Do and Wait Until Trigger	Ensures accurate status updates, improving reliability and visibility.	Reliability / Accuracy
	Record Resolution	Automatically updates related Incident records, reducing follow-up work for engineers.	Operational Efficiency
	Notification Insights	Captures workflow metrics, enabling process improvements and faster response times.	Insights & Continuous Improvement
AI Search Custom	Article Linking	Connects incidents to relevant Knowledge Base articles, accelerating problem resolution.	Knowledge Management
	Expanding Search Sources	Ensures critical guidance is discoverable, improving MTTR (Mean Time to Repair) and scaling knowledge sharing.	Knowledge Management

DevOps Usage

As a ServiceNow Admin and Jr Developer at Netflix, I designed this system for DevOps engineers to rapidly detect and remediate EC2 instance failures:

Rapid response: Reduces the time to identify and remediate failing instances from up to 45 minutes to just a few minutes.
Consistent remediation: One-click Trigger EC2 Remediation UI Action ensures predictable, error-free recovery.
Real-time visibility: Slack notifications and updated Incident records keep engineers informed of progress and status changes.
Operational insights: Metrics from Flow Designer allow engineers to monitor reset times and identify opportunities for standardization (currently 2–6 minutes).
Reduced manual workload: Automation frees engineers to focus on higher-value tasks instead of repetitive incident management.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
screenshots		screenshots
Diagram.png		Diagram.png
README.md		README.md
custom_EC2RemediationHelper.js		custom_EC2RemediationHelper.js
ec2-remediation-system.xml		ec2-remediation-system.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Netflix and ServiceNow Failing Instance Remediation System

System Overview

Business Outcomes

Tools and Technologies Used

Architecture Diagram

Implementation Steps

Step 1: Application Setup

Step 2: Table Setup

Step 3: AWS Integration Configuration

Step 4: UI Action and Script Include Implementation

Step 5: Flow Designer Workflow Creation

Workflow Trigger

Workflow Actions

Step 6: Knowledge Base Content

Step 7: AI Search Integration

Step 8: Testing and Validation

Optimization

DevOps Usage

🤝🏾 Connect With Me

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Netflix and ServiceNow Failing Instance Remediation System

System Overview

Business Outcomes

Tools and Technologies Used

Architecture Diagram

Implementation Steps

Step 1: Application Setup

Step 2: Table Setup

Step 3: AWS Integration Configuration

Step 4: UI Action and Script Include Implementation

Step 5: Flow Designer Workflow Creation

Workflow Trigger

Workflow Actions

Step 6: Knowledge Base Content

Step 7: AI Search Integration

Step 8: Testing and Validation

Optimization

DevOps Usage

🤝🏾 Connect With Me

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages