|
| 1 | +## Overview |
| 2 | + |
| 3 | +The Full Table Disaster Recovery (DR) system provides automated recovery capabilities for critical DynamoDB tables in the CompactConnect system. This system allows administrators to perform Point-in-Time Recovery (PITR) operations when tables become corrupted or require rollback to a previous state. |
| 4 | + |
| 5 | +**⚠️ WARNING: This system performs a HARD RESET of the target table, permanently deleting all current data before restoring from the specified timestamp.** |
| 6 | + |
| 7 | +## When to Use |
| 8 | + |
| 9 | +This Disaster Recovery process should only be run in the event that the system experiences an event that causes |
| 10 | +system-wide failures, such as the following scenarios: |
| 11 | + |
| 12 | +1. **Data Corruption**: When a table contains corrupted or invalid data that cannot be fixed through normal operations |
| 13 | +2. **Accidental Data Loss**: When critical data has been accidentally deleted or modified |
| 14 | +3. **Failed Deployments**: When a deployment has caused data integrity issues |
| 15 | +4. **Security Incidents**: When unauthorized modifications require rolling back to a clean state |
| 16 | +5. **System-wide Issues**: When multiple tables need to be restored to a consistent point in time |
| 17 | + |
| 18 | +## Architecture |
| 19 | + |
| 20 | +### Two-Phase Recovery Process |
| 21 | +DynamoDB PITR cannot directly restore data into your production database. Instead, it creates a new table with data matching the exact values you had in your production database at the specified timestamp. You as the owner of the database must decide what to do with that data from that point in time. For the purposes of disaster recovery rollback, we have determined to get the data into the production table by performing a 'hard reset', meaning **all the current data in the production table is deleted**, then we copy over the data from the temporary table into the production table. This process includes the following step functions. |
| 22 | + |
| 23 | +1. **RestoreDynamoDbTable Step Function** (Parent) |
| 24 | + - Creates a backup of the current table for post-incident analysis |
| 25 | + - Restores a temporary table from the specified PITR timestamp |
| 26 | + - Invokes the SyncTableData Step Function |
| 27 | + |
| 28 | +2. **SyncTableData Step Function** (Child) |
| 29 | + - **Delete Phase**: Removes all records from the production table |
| 30 | + - **Copy Phase**: Copies all records from the temporary table to the production table |
| 31 | + |
| 32 | +Once this process is complete, the data in the target table will be restored with the data from the specified point in time. |
| 33 | + |
| 34 | +### Per-Table Isolation |
| 35 | + |
| 36 | +Each DynamoDB table has its own dedicated pair of Step Functions: |
| 37 | + |
| 38 | +- `DRRestoreDynamoDbTable{TableName}StateMachine` |
| 39 | +- `{TableName}DRSyncTableDataStateMachine` |
| 40 | + |
| 41 | +This design allows for: |
| 42 | +- **Targeted Recovery**: Restore only the affected table(s) |
| 43 | +- **Granular Permissions**: Each Step Function has minimal, table-specific permissions |
| 44 | + |
| 45 | +## Supported Tables |
| 46 | + |
| 47 | +The following tables are configured for disaster recovery: |
| 48 | + |
| 49 | +| Table Name | Step Function Prefix | Purpose | Recovery Notes | |
| 50 | +|------------|---------------------|---------|----------------| |
| 51 | +| TransactionHistoryTable | `TransactionHistoryTable` | transaction data from authorize.net | Can be rolled back independently. After DR rollback, run the Transaction History Processing Workflow Step Function for each compact for every day where data was lost to restore all transaction data from Authorize.net accounts. The Transaction History Processing Workflow step functions are idempotent. They can be run multiple times without producing duplicate transaction items in the table. | |
| 52 | +| ProviderTable | `ProviderTable` | Provider information and GSIs | **Dependent on SSN table** - Can be rolled back without updating SSN table since SSN table does not have a dependency on the provider table. **⚠️ WARNING**: If SSN table needs rollback, the provider table will likely need to be restored to same point in time as SSN table. Otherwise new provider IDs may be generated for existing SSNs causing data inconsistency/orphaned providers that won't receive license updates. After DR rollback, consider that the transaction history table will have a list of all privileges purchased as recorded in Authorize.net, and can be used as a data source for repopulating any privilege records that may have been lost as a result of the rollback.| |
| 53 | +| CompactConfigurationTable | `CompactConfigurationTable` | System configuration data | Can be rolled back independently of other tables. Contains configuration set by compact and state admins. Admins may need to reset configurations that were lost as a result of the rollback. | |
| 54 | +| DataEventTable | `DataEventTable` | License data events | Used for downstream processing events triggered by Event Bridge event bus. In the event of recovery, many of these events can likely be restored by replaying events placed on the event bus. See https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-archive.html | |
| 55 | +| UsersTable | `UsersTable` | Staff user permissions and account data | Can be rolled back independently. Contains staff user permissions and account information. Admins may need to re-invite new users or reset permissions that were lost as a result of the rollback. | |
| 56 | + |
| 57 | +> **Note**: The SSN table is excluded due to additional security requirements and will be handled in a future implementation. |
| 58 | +
|
| 59 | +## Running the Disaster Recovery Workflow |
| 60 | + |
| 61 | +## Pre-Execution Checklist |
| 62 | + |
| 63 | +1. ✅ **Verify Impact**: Confirm which applications/users will be affected |
| 64 | +2. ✅ **Communication**: Notify stakeholders of the planned recovery |
| 65 | +3. ✅ **Timestamp Selection**: Determine the UTC timestamp to restore to (must be within 35 days) |
| 66 | +4. ✅ **Access Verification**: Confirm you have necessary permissions (Currently only AWS account admins can trigger a DR) |
| 67 | + |
| 68 | +### Step 1: Start Recovery Mode |
| 69 | + |
| 70 | +Before executing the DR Step Function, you must throttle all Lambda functions to prevent other data operations from occurring while attempting to roll any databases back. There is a script provided to perform this action: |
| 71 | + |
| 72 | +```bash |
| 73 | +# Navigate to the disaster_recovery directory |
| 74 | +cd backend/compact-connect/disaster_recovery |
| 75 | + |
| 76 | +# Start recovery mode for the environment (replace "Prod" with your target environment) |
| 77 | +python start_recovery_mode.py --environment Prod |
| 78 | +``` |
| 79 | + |
| 80 | +This will put the system into recovery mode by: |
| 81 | +- Setting reserved concurrency to 0 for all environment Lambda functions, so they can't be invoked |
| 82 | +- Leaving Disaster Recovery functions operational |
| 83 | +- **Important**: If any functions failed to throttle, you may rerun the script or manually check their reserved concurrency settings if needed. The script is idempotent and can be run multiple times. |
| 84 | + |
| 85 | +### Step 2: Execute Disaster Recovery Step Function For Specific Tables |
| 86 | +#### Prerequisites |
| 87 | +- Identify the exact table name from the DynamoDB console (needed for `tableNameRecoveryConfirmation`) |
| 88 | +- Verify the PITR timestamp is correct |
| 89 | +- Create a unique incident ID for tracking (see [Execution Request Parameter Details](#execution-request-parameter-details)) |
| 90 | + |
| 91 | +When you are ready to perform a rollback, find the step function for the specific table you need to rollback (`DRRestoreDynamoDbTable{TableName}StateMachine`) and start an execution with the following input (replace placeholders with your values) |
| 92 | + |
| 93 | +```json |
| 94 | +{ |
| 95 | + "incidentId": "<YOUR INCIDENT ID HERE>", |
| 96 | + "pitrBackupTime": "<UTC datetime string>", |
| 97 | + "tableNameRecoveryConfirmation": "<TABLE NAME YOU ARE TRYING TO RECOVER>" |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | +#### Execution Request Parameter Details |
| 102 | + |
| 103 | +- **`incidentId`** (required) |
| 104 | + - Purpose: Unique identifier for tracking this recovery operation |
| 105 | + - Format: String (80 chars or less, allows alphanumeric and hyphens) |
| 106 | + - Example: `"incident-2025-001"`, `"corruption-fix-20250115"` |
| 107 | + - Used in: Backup names, restored table names, execution tracking |
| 108 | + |
| 109 | +- **`pitrBackupTime`** (required) |
| 110 | + - Purpose: The timestamp to restore the table to |
| 111 | + - Format: UTC datetime string |
| 112 | + - Example: `"2030-01-15T12:39:46Z"` |
| 113 | + - Constraints: Must be within the PITR retention window (35 days) |
| 114 | + |
| 115 | +- **`tableNameRecoveryConfirmation`** (required) |
| 116 | + - Purpose: Security guard rail to prevent accidental execution |
| 117 | + - Format: Exact table name being recovered (you can copy this from the DynamoDB console) |
| 118 | + - Example: `"Prod-PersistentStack-DataEventTable00A96798-C6VX9JVDOYGN"` |
| 119 | + - Validation: Must match the actual destination table name |
| 120 | + |
| 121 | +example: |
| 122 | +```json |
| 123 | +{ |
| 124 | + "incidentId": "transaction-corruption-20250115", |
| 125 | + "pitrBackupTime": "2025-01-15T09:00:00Z", |
| 126 | + "tableNameRecoveryConfirmation": "Prod-PersistentStack-TransactionHistoryTable00A96798-C6VX9JVDOYGN" |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +#### Running Step Functions from AWS Console |
| 131 | + |
| 132 | +1. Navigate to Step Functions in the AWS Console |
| 133 | +2. Find the appropriate Step Function(s) for the table(s) you need to recover (e.g., `DRRestoreDynamoDbTableTransactionHistoryTableStateMachine`) |
| 134 | +3. For each step function you need to run, Click "Start Execution" |
| 135 | +4. Enter the JSON payload in the input field |
| 136 | +5. Click "Start Execution" and wait for completion (multiple Step functions can be run concurrently if you are restoring multiple tables) |
| 137 | + |
| 138 | +### Step 3: End Recovery Mode |
| 139 | + |
| 140 | +**⚠️CRITICAL**: Only proceed after ALL recovery Step Functions you have run have completed successfully. |
| 141 | + |
| 142 | +After the DR Step Function completes successfully for each table you need to restore, end the recovery mode to restore normal operations: |
| 143 | + |
| 144 | +```bash |
| 145 | +# End recovery mode for the environment |
| 146 | +python end_recovery_mode.py --environment Prod |
| 147 | +``` |
| 148 | + |
| 149 | +This will: |
| 150 | +- Remove reserved concurrency throttling from all Lambda functions |
| 151 | +- Restore normal application operations |
| 152 | +- Complete the disaster recovery process |
| 153 | +- **Important**: If any functions failed to unthrottle, you may rerun the script or manually check their reserved concurrency settings if needed. The script is idempotent and can be run multiple times. |
| 154 | + |
| 155 | +### Post-Execution |
| 156 | + |
| 157 | +1. **Verify Recovery**: Confirm data integrity and completeness |
| 158 | +2. **Application Testing**: Test critical application functions |
| 159 | +3. **Documentation**: Update incident documentation with recovery details |
| 160 | +4. **Cleanup Review**: Cleanup temporary resources after post-incident analysis. |
| 161 | + |
| 162 | +### Operational Constraints |
| 163 | + |
| 164 | +- **Data Loss**: All data newer than the PITR timestamp will be permanently lost. The backup snapshot may be restored post-recovery to determine which records can potentially be recovered. |
| 165 | +- **Dependencies**: Related tables may need coordinated restoration for consistency. |
| 166 | + |
| 167 | +## Monitoring and Troubleshooting |
| 168 | +### Common Issues and Solutions |
| 169 | + |
| 170 | +#### Invalid table name |
| 171 | +- **Cause**: `tableNameRecoveryConfirmation` doesn't match actual table name (this parameter is used to prevent accidental recovery on a database) |
| 172 | +- **Solution**: Copy exact table name from DynamoDB console |
| 173 | + |
| 174 | +#### Restore timestamp out of range |
| 175 | +- **Cause**: PITR timestamp is outside the 35-day retention window |
| 176 | +- **Solution**: Choose a more recent timestamp within the retention period |
| 177 | + |
| 178 | +## Complete Table Deletion Recovery (Manual Backup Restoration) |
| 179 | + |
| 180 | +**⚠️ CRITICAL**: This section applies ONLY when a DynamoDB table has been completely deleted and PITR is not available. This requires manual intervention and cannot use the automated Step Functions. |
| 181 | + |
| 182 | +### Recovery Steps |
| 183 | +Depending on how the table was deleted, there may be a latest 'snapshot' backup in the DynamoDB console that you can recover from. If that snapshot is not available, the system performs daily backups of our tables and store them in the AWS Backup service that you can recover from. |
| 184 | + |
| 185 | +#### Step 1: Locate the Latest Backup |
| 186 | + |
| 187 | +##### Option A: DynamoDB Console |
| 188 | +1. Navigate to DynamoDB Console → Backups |
| 189 | +2. Find the most recent backup for the deleted table |
| 190 | +3. Note the backup name and creation time |
| 191 | + |
| 192 | +##### Option B: AWS Backup Console |
| 193 | +1. Navigate to AWS Backup Console → Backup Vaults |
| 194 | +2. Find the most recent recovery point for the deleted table |
| 195 | +3. **CRITICAL**: Note the "Original table name" from the recovery point details |
| 196 | + |
| 197 | +#### Step 2: Restore Table from Backup |
| 198 | + |
| 199 | +1. **From DynamoDB Console**: |
| 200 | + - Go to DynamoDB → Backups |
| 201 | + - Select the backup → "Restore" |
| 202 | + - **CRITICAL Configuration**: |
| 203 | + - **Table Name**: Must match EXACTLY the original deleted table name |
| 204 | + - **Encryption**: Select "Customer managed key" |
| 205 | + - **KMS Key**: Choose `<environment>-PersistentStack-shared-encryption-key` for non-ssn tables, `ssn-key` for the SSN table |
| 206 | + - Example: `Prod-PersistentStack-shared-encryption-key` |
| 207 | + - **Global Secondary Indexes (GSIs)**: Ensure ALL original GSIs are included in the restore by selecting 'Restore the entire table' |
| 208 | + - Select 'Restore' |
| 209 | + |
| 210 | +2. **From AWS Backup Console**: |
| 211 | + - Navigate to Recovery Points → Select the backup |
| 212 | + - Click "Restore" |
| 213 | + - **CRITICAL Configuration**: |
| 214 | + - **New Table Name**: Use the EXACT "Original table name" from the recovery point |
| 215 | + - **Encryption**: Choose an AWS KMS key -> `<environment>-PersistentStack-shared-encryption-key` for non-ssn tables, `ssn-key` for the SSN table |
| 216 | + - **GSIs**: Verify all original GSIs are restored |
| 217 | + - Select 'Restore Backup' |
| 218 | + |
| 219 | +#### Step 3: Verify Restoration |
| 220 | + |
| 221 | +1. **Table Configuration**: |
| 222 | + - ✅ Table name matches exactly (including environment prefix and suffix) |
| 223 | + - ✅ All Global Secondary Indexes are present |
| 224 | + - ✅ Encryption is set to the correct KMS key |
| 225 | + - ✅ Table status is "ACTIVE" |
| 226 | + |
| 227 | +2. **Data Verification**: |
| 228 | + - Spot-check critical records |
| 229 | + - Verify record counts are reasonable |
| 230 | + - Verify application functionality with the restored table |
0 commit comments