Skip to content

Commit 9362fcd

Browse files
Support license upload reverts (#1188)
In the event that a state inadvertently uploads invalid data into the system for a compact, we need a way to effectively 'undo' that upload so that license data within the system can be reverted to its state before the upload. This PR includes the following enhancements to the system: ## Move license/privilege update records under separate sort key to decouple updates from other main records To reduce risk that massive invalid updates will cause the system to crash when loading provider data, we have migrated the sort keys of our update records to follow a tier based pattern, which will allow us to query for update records only as needed. This requires some more complexity in the query patterns used to grab provider records, but most of this complexity is abstracted by a small number of methods. ## Add licenseUploadDate GSI We've added a GSI to the provider table to track when uploads occurred for every license update record, and we will begin tracking the first time a license record was uploaded into the system. The PK for this new GSI includes the compact, jurisdiction, and upload month. The SK includes the epoch timestamp of the time the record was uploaded/updated, the license type, and the provider id. ``` licenseUploadDateGSIPK = f'C#{in_data["compact"].lower()}#J#{in_data["jurisdiction"].lower()}#D#{YYYY-MM}' licenseUploadDateGSISK = f'TIME#{upload_epoch_time}#LT#{in_data["licenseType"]}#PID#{in_data["providerId"]}' ``` These **OPTIONAL** GSI fields will be added to all license and license update records moving forward for every ingest event. ## Add Step Function for reverting license uploads We've added a step function that takes in the compact, jurisdiction, and time window (start date time and end date time). The step function looks up all licenses that were uploaded for that compact/jurisdiction between the timestamps using the new GSI, and reverts the license record to the latest state prior to the rollback time if there were not other updates made to that provider since the upload time period. When completed it generates a JSON report stored in S3 which includes details about which licenses were reverted and which providers will require manual review due to detected updates not related to the license upload. These enhancements will protect the system from accidental uploads and provide a process for reverting such incidents. ### Requirements List - A migration script has been added to convert all existing update records to the new sort key pattern. This migration path is **fully** backwards compatible, meaning the system will continue to function before, during, and after the cutover without downtime. Closes #1175 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Disaster recovery: license-upload rollback workflow with persisted results and revert events. * Tiered update model and migration tooling for richer historical queries. * License-upload date tracking and index to speed provider queries. * **Bug Fixes** * SSN uniqueness enforcement now scoped per license type in batch uploads. * **Documentation** * Added comprehensive disaster-recovery guides and rollback usage docs. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent a82c049 commit 9362fcd

56 files changed

Lines changed: 6329 additions & 695 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

backend/compact-connect/common_constructs/user_pool.py

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -144,10 +144,10 @@ def __init__( # pylint: disable=too-many-arguments
144144
)
145145

146146
def add_custom_app_client_domain(
147-
self,
148-
hosted_zone: IHostedZone,
149-
scope: Construct,
150-
app_client_domain_prefix: str,
147+
self,
148+
hosted_zone: IHostedZone,
149+
scope: Construct,
150+
app_client_domain_prefix: str,
151151
):
152152
"""
153153
Creates a custom subdomain for the cognito app client in the form of:
@@ -159,17 +159,11 @@ def add_custom_app_client_domain(
159159
domain_name = f'{domain_prefix}.{hosted_zone.zone_name}'
160160
cert_id = f'{app_client_domain_prefix}AuthCert'
161161
cert = Certificate(
162-
scope,
163-
cert_id,
164-
domain_name=domain_name,
165-
validation=CertificateValidation.from_dns(hosted_zone=hosted_zone)
162+
scope, cert_id, domain_name=domain_name, validation=CertificateValidation.from_dns(hosted_zone=hosted_zone)
166163
)
167164
domain = self.add_domain(
168165
f'{app_client_domain_prefix}UserPoolDomain',
169-
custom_domain=CustomDomainOptions(
170-
certificate=cert,
171-
domain_name=domain_name
172-
),
166+
custom_domain=CustomDomainOptions(certificate=cert, domain_name=domain_name),
173167
managed_login_version=ManagedLoginVersion.NEWER_MANAGED_LOGIN,
174168
)
175169

@@ -195,7 +189,7 @@ def add_custom_app_client_domain(
195189
'id': 'AwsSolutions-IAM5',
196190
'appliesTo': ['Resource::*'],
197191
'reason': 'This is an AWS-managed custom resource Lambda that requires wildcard permissions'
198-
'to describe CloudFront distributions.',
192+
'to describe CloudFront distributions.',
199193
}
200194
],
201195
)
@@ -211,7 +205,7 @@ def add_custom_app_client_domain(
211205
'appliesTo': [
212206
'Policy::arn:<AWS::Partition>:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
213207
],
214-
'reason': 'This is an AWS-managed custom resource Lambda that uses the standard execution role.'
208+
'reason': 'This is an AWS-managed custom resource Lambda that uses the standard execution role.',
215209
}
216210
],
217211
)
@@ -223,21 +217,21 @@ def add_custom_app_client_domain(
223217
{
224218
'id': 'HIPAA.Security-LambdaDLQ',
225219
'reason': 'This is an AWS-managed custom resource Lambda used only during deployment.'
226-
'A DLQ is not necessary.',
220+
'A DLQ is not necessary.',
227221
},
228222
{
229223
'id': 'HIPAA.Security-LambdaInsideVPC',
230224
'reason': 'This is an AWS-managed custom resource Lambda that needs internet access to'
231-
'describe CloudFront distributions.',
225+
'describe CloudFront distributions.',
232226
},
233227
],
234228
)
235229

236230
self.app_client_custom_domain = domain
237231

238232
def add_default_app_client_domain(
239-
self,
240-
non_custom_domain_prefix: str,
233+
self,
234+
non_custom_domain_prefix: str,
241235
):
242236
"""
243237
Creates a cognito based sub domain in the form of:
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
## Overview
2+
3+
The Full Table Disaster Recovery (DR) system provides automated recovery capabilities for critical DynamoDB tables in the CompactConnect system. This system allows administrators to perform Point-in-Time Recovery (PITR) operations when tables become corrupted or require rollback to a previous state.
4+
5+
**⚠️ WARNING: This system performs a HARD RESET of the target table, permanently deleting all current data before restoring from the specified timestamp.**
6+
7+
## When to Use
8+
9+
This Disaster Recovery process should only be run in the event that the system experiences an event that causes
10+
system-wide failures, such as the following scenarios:
11+
12+
1. **Data Corruption**: When a table contains corrupted or invalid data that cannot be fixed through normal operations
13+
2. **Accidental Data Loss**: When critical data has been accidentally deleted or modified
14+
3. **Failed Deployments**: When a deployment has caused data integrity issues
15+
4. **Security Incidents**: When unauthorized modifications require rolling back to a clean state
16+
5. **System-wide Issues**: When multiple tables need to be restored to a consistent point in time
17+
18+
## Architecture
19+
20+
### Two-Phase Recovery Process
21+
DynamoDB PITR cannot directly restore data into your production database. Instead, it creates a new table with data matching the exact values you had in your production database at the specified timestamp. You as the owner of the database must decide what to do with that data from that point in time. For the purposes of disaster recovery rollback, we have determined to get the data into the production table by performing a 'hard reset', meaning **all the current data in the production table is deleted**, then we copy over the data from the temporary table into the production table. This process includes the following step functions.
22+
23+
1. **RestoreDynamoDbTable Step Function** (Parent)
24+
- Creates a backup of the current table for post-incident analysis
25+
- Restores a temporary table from the specified PITR timestamp
26+
- Invokes the SyncTableData Step Function
27+
28+
2. **SyncTableData Step Function** (Child)
29+
- **Delete Phase**: Removes all records from the production table
30+
- **Copy Phase**: Copies all records from the temporary table to the production table
31+
32+
Once this process is complete, the data in the target table will be restored with the data from the specified point in time.
33+
34+
### Per-Table Isolation
35+
36+
Each DynamoDB table has its own dedicated pair of Step Functions:
37+
38+
- `DRRestoreDynamoDbTable{TableName}StateMachine`
39+
- `{TableName}DRSyncTableDataStateMachine`
40+
41+
This design allows for:
42+
- **Targeted Recovery**: Restore only the affected table(s)
43+
- **Granular Permissions**: Each Step Function has minimal, table-specific permissions
44+
45+
## Supported Tables
46+
47+
The following tables are configured for disaster recovery:
48+
49+
| Table Name | Step Function Prefix | Purpose | Recovery Notes |
50+
|------------|---------------------|---------|----------------|
51+
| TransactionHistoryTable | `TransactionHistoryTable` | transaction data from authorize.net | Can be rolled back independently. After DR rollback, run the Transaction History Processing Workflow Step Function for each compact for every day where data was lost to restore all transaction data from Authorize.net accounts. The Transaction History Processing Workflow step functions are idempotent. They can be run multiple times without producing duplicate transaction items in the table. |
52+
| ProviderTable | `ProviderTable` | Provider information and GSIs | **Dependent on SSN table** - Can be rolled back without updating SSN table since SSN table does not have a dependency on the provider table. **⚠️ WARNING**: If SSN table needs rollback, the provider table will likely need to be restored to same point in time as SSN table. Otherwise new provider IDs may be generated for existing SSNs causing data inconsistency/orphaned providers that won't receive license updates. After DR rollback, consider that the transaction history table will have a list of all privileges purchased as recorded in Authorize.net, and can be used as a data source for repopulating any privilege records that may have been lost as a result of the rollback.|
53+
| CompactConfigurationTable | `CompactConfigurationTable` | System configuration data | Can be rolled back independently of other tables. Contains configuration set by compact and state admins. Admins may need to reset configurations that were lost as a result of the rollback. |
54+
| DataEventTable | `DataEventTable` | License data events | Used for downstream processing events triggered by Event Bridge event bus. In the event of recovery, many of these events can likely be restored by replaying events placed on the event bus. See https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-archive.html |
55+
| UsersTable | `UsersTable` | Staff user permissions and account data | Can be rolled back independently. Contains staff user permissions and account information. Admins may need to re-invite new users or reset permissions that were lost as a result of the rollback. |
56+
57+
> **Note**: The SSN table is excluded due to additional security requirements and will be handled in a future implementation.
58+
59+
## Running the Disaster Recovery Workflow
60+
61+
## Pre-Execution Checklist
62+
63+
1.**Verify Impact**: Confirm which applications/users will be affected
64+
2.**Communication**: Notify stakeholders of the planned recovery
65+
3.**Timestamp Selection**: Determine the UTC timestamp to restore to (must be within 35 days)
66+
4.**Access Verification**: Confirm you have necessary permissions (Currently only AWS account admins can trigger a DR)
67+
68+
### Step 1: Start Recovery Mode
69+
70+
Before executing the DR Step Function, you must throttle all Lambda functions to prevent other data operations from occurring while attempting to roll any databases back. There is a script provided to perform this action:
71+
72+
```bash
73+
# Navigate to the disaster_recovery directory
74+
cd backend/compact-connect/disaster_recovery
75+
76+
# Start recovery mode for the environment (replace "Prod" with your target environment)
77+
python start_recovery_mode.py --environment Prod
78+
```
79+
80+
This will put the system into recovery mode by:
81+
- Setting reserved concurrency to 0 for all environment Lambda functions, so they can't be invoked
82+
- Leaving Disaster Recovery functions operational
83+
- **Important**: If any functions failed to throttle, you may rerun the script or manually check their reserved concurrency settings if needed. The script is idempotent and can be run multiple times.
84+
85+
### Step 2: Execute Disaster Recovery Step Function For Specific Tables
86+
#### Prerequisites
87+
- Identify the exact table name from the DynamoDB console (needed for `tableNameRecoveryConfirmation`)
88+
- Verify the PITR timestamp is correct
89+
- Create a unique incident ID for tracking (see [Execution Request Parameter Details](#execution-request-parameter-details))
90+
91+
When you are ready to perform a rollback, find the step function for the specific table you need to rollback (`DRRestoreDynamoDbTable{TableName}StateMachine`) and start an execution with the following input (replace placeholders with your values)
92+
93+
```json
94+
{
95+
"incidentId": "<YOUR INCIDENT ID HERE>",
96+
"pitrBackupTime": "<UTC datetime string>",
97+
"tableNameRecoveryConfirmation": "<TABLE NAME YOU ARE TRYING TO RECOVER>"
98+
}
99+
```
100+
101+
#### Execution Request Parameter Details
102+
103+
- **`incidentId`** (required)
104+
- Purpose: Unique identifier for tracking this recovery operation
105+
- Format: String (80 chars or less, allows alphanumeric and hyphens)
106+
- Example: `"incident-2025-001"`, `"corruption-fix-20250115"`
107+
- Used in: Backup names, restored table names, execution tracking
108+
109+
- **`pitrBackupTime`** (required)
110+
- Purpose: The timestamp to restore the table to
111+
- Format: UTC datetime string
112+
- Example: `"2030-01-15T12:39:46Z"`
113+
- Constraints: Must be within the PITR retention window (35 days)
114+
115+
- **`tableNameRecoveryConfirmation`** (required)
116+
- Purpose: Security guard rail to prevent accidental execution
117+
- Format: Exact table name being recovered (you can copy this from the DynamoDB console)
118+
- Example: `"Prod-PersistentStack-DataEventTable00A96798-C6VX9JVDOYGN"`
119+
- Validation: Must match the actual destination table name
120+
121+
example:
122+
```json
123+
{
124+
"incidentId": "transaction-corruption-20250115",
125+
"pitrBackupTime": "2025-01-15T09:00:00Z",
126+
"tableNameRecoveryConfirmation": "Prod-PersistentStack-TransactionHistoryTable00A96798-C6VX9JVDOYGN"
127+
}
128+
```
129+
130+
#### Running Step Functions from AWS Console
131+
132+
1. Navigate to Step Functions in the AWS Console
133+
2. Find the appropriate Step Function(s) for the table(s) you need to recover (e.g., `DRRestoreDynamoDbTableTransactionHistoryTableStateMachine`)
134+
3. For each step function you need to run, Click "Start Execution"
135+
4. Enter the JSON payload in the input field
136+
5. Click "Start Execution" and wait for completion (multiple Step functions can be run concurrently if you are restoring multiple tables)
137+
138+
### Step 3: End Recovery Mode
139+
140+
**⚠️CRITICAL**: Only proceed after ALL recovery Step Functions you have run have completed successfully.
141+
142+
After the DR Step Function completes successfully for each table you need to restore, end the recovery mode to restore normal operations:
143+
144+
```bash
145+
# End recovery mode for the environment
146+
python end_recovery_mode.py --environment Prod
147+
```
148+
149+
This will:
150+
- Remove reserved concurrency throttling from all Lambda functions
151+
- Restore normal application operations
152+
- Complete the disaster recovery process
153+
- **Important**: If any functions failed to unthrottle, you may rerun the script or manually check their reserved concurrency settings if needed. The script is idempotent and can be run multiple times.
154+
155+
### Post-Execution
156+
157+
1. **Verify Recovery**: Confirm data integrity and completeness
158+
2. **Application Testing**: Test critical application functions
159+
3. **Documentation**: Update incident documentation with recovery details
160+
4. **Cleanup Review**: Cleanup temporary resources after post-incident analysis.
161+
162+
### Operational Constraints
163+
164+
- **Data Loss**: All data newer than the PITR timestamp will be permanently lost. The backup snapshot may be restored post-recovery to determine which records can potentially be recovered.
165+
- **Dependencies**: Related tables may need coordinated restoration for consistency.
166+
167+
## Monitoring and Troubleshooting
168+
### Common Issues and Solutions
169+
170+
#### Invalid table name
171+
- **Cause**: `tableNameRecoveryConfirmation` doesn't match actual table name (this parameter is used to prevent accidental recovery on a database)
172+
- **Solution**: Copy exact table name from DynamoDB console
173+
174+
#### Restore timestamp out of range
175+
- **Cause**: PITR timestamp is outside the 35-day retention window
176+
- **Solution**: Choose a more recent timestamp within the retention period
177+
178+
## Complete Table Deletion Recovery (Manual Backup Restoration)
179+
180+
**⚠️ CRITICAL**: This section applies ONLY when a DynamoDB table has been completely deleted and PITR is not available. This requires manual intervention and cannot use the automated Step Functions.
181+
182+
### Recovery Steps
183+
Depending on how the table was deleted, there may be a latest 'snapshot' backup in the DynamoDB console that you can recover from. If that snapshot is not available, the system performs daily backups of our tables and store them in the AWS Backup service that you can recover from.
184+
185+
#### Step 1: Locate the Latest Backup
186+
187+
##### Option A: DynamoDB Console
188+
1. Navigate to DynamoDB Console → Backups
189+
2. Find the most recent backup for the deleted table
190+
3. Note the backup name and creation time
191+
192+
##### Option B: AWS Backup Console
193+
1. Navigate to AWS Backup Console → Backup Vaults
194+
2. Find the most recent recovery point for the deleted table
195+
3. **CRITICAL**: Note the "Original table name" from the recovery point details
196+
197+
#### Step 2: Restore Table from Backup
198+
199+
1. **From DynamoDB Console**:
200+
- Go to DynamoDB → Backups
201+
- Select the backup → "Restore"
202+
- **CRITICAL Configuration**:
203+
- **Table Name**: Must match EXACTLY the original deleted table name
204+
- **Encryption**: Select "Customer managed key"
205+
- **KMS Key**: Choose `<environment>-PersistentStack-shared-encryption-key` for non-ssn tables, `ssn-key` for the SSN table
206+
- Example: `Prod-PersistentStack-shared-encryption-key`
207+
- **Global Secondary Indexes (GSIs)**: Ensure ALL original GSIs are included in the restore by selecting 'Restore the entire table'
208+
- Select 'Restore'
209+
210+
2. **From AWS Backup Console**:
211+
- Navigate to Recovery Points → Select the backup
212+
- Click "Restore"
213+
- **CRITICAL Configuration**:
214+
- **New Table Name**: Use the EXACT "Original table name" from the recovery point
215+
- **Encryption**: Choose an AWS KMS key -> `<environment>-PersistentStack-shared-encryption-key` for non-ssn tables, `ssn-key` for the SSN table
216+
- **GSIs**: Verify all original GSIs are restored
217+
- Select 'Restore Backup'
218+
219+
#### Step 3: Verify Restoration
220+
221+
1. **Table Configuration**:
222+
- ✅ Table name matches exactly (including environment prefix and suffix)
223+
- ✅ All Global Secondary Indexes are present
224+
- ✅ Encryption is set to the correct KMS key
225+
- ✅ Table status is "ACTIVE"
226+
227+
2. **Data Verification**:
228+
- Spot-check critical records
229+
- Verify record counts are reasonable
230+
- Verify application functionality with the restored table

0 commit comments

Comments
 (0)