diff --git a/trouble-dns-resolution-s3-proxy.md b/trouble-dns-resolution-s3-proxy.md new file mode 100644 index 0000000..508cdeb --- /dev/null +++ b/trouble-dns-resolution-s3-proxy.md @@ -0,0 +1,177 @@ +# DNS Resolution Issues with S3 Proxy in Private VPC Deployments + +## Tags + +`dns`, `s3-proxy`, `ecs`, `network`, `private-vpc`, `awsvpc`, `troubleshooting` + +## Summary + +When deploying Quilt in a private VPC with custom DNS configuration, the S3 proxy service may fail to resolve internal hostnames (including the internal registry and AWS S3 endpoints). This occurs because the s3-proxy container obtains its DNS resolver from `/etc/resolv.conf`, which may not include the AWS-provided DNS server (169.254.169.253 or VPC+2 address) when custom DHCP options are configured. + +--- + +## Symptoms + +- **S3 proxy fails to connect to the internal registry** + - Error: `could not resolve internal registry hostname` + - Downloads from the Quilt catalog fail + - Package operations may time out + +- **S3 proxy cannot resolve AWS S3 endpoints** + - Requests to S3 buckets fail + - Error logs show DNS resolution failures in nginx + +- **Observable indicators:** + - ECS task logs show nginx resolver errors + - `502 Bad Gateway` errors in the catalog + - Package downloads consistently fail while other Quilt functionality works + +- **Common environment:** + - Private VPC with custom DHCP options + - On-premises DNS servers configured + - VPN or Direct Connect to on-premises infrastructure + - AWS-provided DNS (169.254.169.253) not included in DHCP options + +## Likely Causes + +### 1. Custom DHCP Options Excluding AWS DNS + +When customers configure custom DHCP option sets for their VPC that specify on-premises DNS servers without including AWS's DNS resolver, ECS tasks running in `awsvpc` network mode will not have access to AWS's DNS. + +The Quilt S3 proxy service uses nginx, which reads the nameserver from `/etc/resolv.conf` at startup: + +```bash +# From s3-proxy/run-nginx.sh +nameserver=$(awk '{if ($1 == "nameserver") { print $2; exit;}}' < /etc/resolv.conf) +``` + +If this nameserver cannot resolve: +- Internal AWS hostnames (e.g., S3 VPC endpoint DNS names) +- Cloud Map service discovery names (e.g., `registry.${StackName}`) + +Then the S3 proxy will fail. + +### 2. VPC Endpoint Private DNS Not Resolving + +Even with an S3 VPC endpoint configured, if the task's DNS resolver cannot reach AWS's DNS infrastructure, private DNS names for the endpoint won't resolve. + +### 3. Service Discovery (Cloud Map) DNS Failures + +Quilt uses AWS Cloud Map for internal service discovery. The registry service registers as `registry.${AWS::StackName}` in a private DNS namespace. Resolving this name requires access to the Route 53 Resolver (AWS DNS). + +## Recommendation + +### Immediate Fix: Add AWS DNS to DHCP Options + +1. **Modify your VPC's DHCP option set** to include the AWS-provided DNS resolver alongside your custom DNS servers: + + **Option A**: Add `169.254.169.253` (works for EC2 instances) + + **Option B**: Add your VPC's DNS address at `+2` (e.g., `10.0.0.2` for a `10.0.0.0/16` VPC) + +2. **Update the DHCP options** in AWS Console or via CLI: + + ```bash + aws ec2 create-dhcp-options \ + --dhcp-configurations \ + "Key=domain-name-servers,Values=10.0.0.2,YOUR_CUSTOM_DNS_1,YOUR_CUSTOM_DNS_2" + ``` + +3. **Associate the new DHCP options** with your VPC and restart ECS tasks to pick up the new configuration. + +### Workaround: DNS Forwarding + +If you cannot modify DHCP options, configure your on-premises DNS servers to forward queries for AWS domains to the AWS DNS resolver: + +1. **Forward zones:** + - `amazonaws.com` + - `aws.amazon.com` + - Your Cloud Map namespace (e.g., `your-stack-name`) + +2. Configure conditional forwarding to the Route 53 Resolver inbound endpoint. + +### Future Enhancement Request + +The customer has requested the ability to specify custom DNS servers as a CloudFormation parameter. This would involve adding `DnsServers` to the ECS task definitions: + +```yaml +# Example of desired functionality +Parameters: + CustomDnsServers: + Type: CommaDelimitedList + Default: "" + Description: "Custom DNS servers for ECS tasks (optional)" +``` + +This enhancement is being tracked internally. + +## Debugging Steps + +### 1. Verify DNS in the running container + +If ECS Exec is enabled, connect to the s3-proxy container: + +```bash +aws ecs execute-command \ + --cluster YOUR_CLUSTER \ + --task TASK_ID \ + --container s3-proxy \ + --command "/bin/sh" \ + --interactive +``` + +Then check: + +```bash +cat /etc/resolv.conf +nslookup registry.YOUR_STACK_NAME +nslookup s3.us-east-1.amazonaws.com +``` + +### 2. Check CloudWatch Logs + +Look for DNS resolution errors in the s3-proxy log group: + +``` +/quilt/${StackName}/s3-proxy +``` + +Common error patterns: +- `[error] ... could not be resolved` +- `upstream timed out` +- `no resolver defined to resolve` + +### 3. Verify VPC DNS Settings + +```bash +aws ec2 describe-vpc-attribute \ + --vpc-id YOUR_VPC_ID \ + --attribute enableDnsSupport + +aws ec2 describe-vpc-attribute \ + --vpc-id YOUR_VPC_ID \ + --attribute enableDnsHostnames +``` + +Both should return `true`. + +### 4. Check DHCP Options + +```bash +aws ec2 describe-dhcp-options \ + --dhcp-options-ids $(aws ec2 describe-vpcs --vpc-ids YOUR_VPC_ID \ + --query 'Vpcs[0].DhcpOptionsId' --output text) +``` + +Verify that `domain-name-servers` includes an AWS DNS resolver. + +## Related Issues + +- [AWS Documentation: DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html) +- [AWS Documentation: DHCP options sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html) +- [ECS Task Networking with awsvpc mode](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html) + +## See Also + +- JSON Encoding Error Hiding Permission Issues (related KB article) +- Private VPC Deployment Best Practices diff --git a/trouble-json-encoding-error-hiding-permissions.md b/trouble-json-encoding-error-hiding-permissions.md new file mode 100644 index 0000000..2ad4a22 --- /dev/null +++ b/trouble-json-encoding-error-hiding-permissions.md @@ -0,0 +1,234 @@ +# JSON Encoding Error Masking Underlying Permission Issues + +## Tags + +`permissions`, `iam`, `s3`, `error-handling`, `debugging`, `s3-proxy`, `troubleshooting` + +## Summary + +When S3 permission errors (e.g., `AccessDenied`) occur, the error response from AWS is XML-formatted. In some code paths, attempts to parse this as JSON result in a `JSONDecodeError`, which masks the original permission error and makes debugging more difficult. + +--- + +## Symptoms + +- **Generic error messages instead of permission errors** + - Error: `JSONDecodeError: Expecting value...` or `Invalid JSON` + - The underlying `AccessDenied` or `Forbidden` error is not visible + +- **Confusing error logs** + - Logs show JSON parsing failures + - No clear indication that the root cause is a missing IAM permission + +- **Operations fail without clear reason** + - Package downloads fail + - Bucket operations time out or return errors + - S3 proxy returns non-descriptive errors + +## Likely Causes + +### 1. AWS S3 Returns XML Error Responses + +AWS S3 returns error responses in XML format: + +```xml + + + AccessDenied + Access Denied + ... + ... + +``` + +When application code expects JSON and attempts to parse this response: + +```python +response_data = json.loads(response.text) # Raises JSONDecodeError +``` + +The original error information is lost in the exception handling. + +### 2. Missing IAM Permissions + +Common permission issues that trigger this: + +- **S3 bucket policy denying access** + - VPC endpoint policies too restrictive + - Bucket policy not allowing the Quilt IAM roles + +- **IAM role missing required permissions** + - `s3:GetObject`, `s3:PutObject`, `s3:ListBucket` missing + - Cross-account access not configured + +- **Resource-based policies conflicting** + - KMS key policies not allowing decrypt + - SNS/SQS policies not allowing publish/receive + +### 3. Error Handling Code Path + +The error may occur in: + +1. S3 proxy nginx → upstream registry → S3 +2. Lambda functions processing S3 events +3. Registry API calling AWS services + +## Recommendation + +### Immediate Debugging Steps + +#### 1. Check IAM Permissions + +Review the IAM roles used by Quilt services: + +| Role | Purpose | +|------|---------| +| `T4BucketReadRole` | Read access to managed buckets | +| `T4BucketWriteRole` | Write access to managed buckets | +| `PackagerRole` | Package operations | +| `ManagedUserRole` | User-assumed role for data access | + +Verify these roles have the required permissions for your buckets. + +#### 2. Test S3 Access Directly + +Use AWS CLI with the Quilt role to test access: + +```bash +# Get credentials from ECS task (if ECS Exec enabled) +aws sts get-caller-identity + +# Test bucket access +aws s3 ls s3://YOUR_BUCKET/ +aws s3 cp s3://YOUR_BUCKET/test-file.txt - +``` + +#### 3. Enable S3 Access Logging + +Enable S3 server access logging to see the actual error codes returned by S3: + +```bash +aws s3api put-bucket-logging \ + --bucket YOUR_BUCKET \ + --bucket-logging-status '{ + "LoggingEnabled": { + "TargetBucket": "YOUR_LOG_BUCKET", + "TargetPrefix": "s3-access-logs/" + } + }' +``` + +#### 4. Check CloudTrail + +Look for `AccessDenied` events in CloudTrail: + +```bash +aws cloudtrail lookup-events \ + --lookup-attributes AttributeKey=EventName,AttributeValue=GetObject \ + --max-items 50 +``` + +Filter for events with `errorCode: AccessDenied`. + +### Common Permission Fixes + +#### Bucket Policy for Quilt Roles + +Ensure your bucket policy allows Quilt roles: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "AllowQuiltAccess", + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::ACCOUNT:role/STACK-T4BucketReadRole-XXXX", + "arn:aws:iam::ACCOUNT:role/STACK-T4BucketWriteRole-XXXX" + ] + }, + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::YOUR_BUCKET", + "arn:aws:s3:::YOUR_BUCKET/*" + ] + } + ] +} +``` + +#### VPC Endpoint Policy + +If using S3 VPC endpoints, ensure the endpoint policy allows access: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "AllowAll", + "Effect": "Allow", + "Principal": "*", + "Action": "s3:*", + "Resource": "*" + } + ] +} +``` + +Or restrict to specific principals/resources as needed. + +#### Cross-Account Access + +For cross-account bucket access, both the bucket policy AND the IAM role policy must allow access: + +1. **Bucket policy** (in bucket account): Allow the Quilt role ARN +2. **IAM policy** (in Quilt account): Allow actions on the bucket ARN + +### Future Improvement + +We are tracking an enhancement to improve error handling so that: + +1. Original AWS error messages are preserved and logged +2. XML error responses from S3 are properly parsed +3. Clear error messages distinguish between permission errors and other failures + +## Debugging Steps + +### 1. Enable Debug Logging + +Set the `FLASK_DEBUG=1` environment variable in the registry container to get more detailed error messages. + +### 2. Check Specific Error Logs + +Look in CloudWatch Logs for patterns: + +**S3 Proxy logs** (`/quilt/${StackName}/s3-proxy`): +``` +# Look for upstream errors +upstream returned... +proxy_pass...error +``` + +**Registry logs** (`/quilt/${StackName}/registry`): +``` +# Look for boto3/botocore errors +ClientError +AccessDenied +``` + +### 3. Reproduce with AWS CLI + +Identify the exact operation failing and reproduce with AWS CLI using the same credentials. + +## Related Issues + +- DNS Resolution Issues with S3 Proxy (related KB article) +- [AWS S3 Error Responses](https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html) +- [IAM Policy Troubleshooting](https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_policies.html)