JSON Schema Specification

The .dbsurveyor.json format is the standard output format for DBSurveyor schema collection. This specification documents the complete structure, validation rules, and usage examples for the schema format.

Overview

The .dbsurveyor.json format provides a comprehensive, validated representation of database schemas with the following characteristics:

Security-First: No credential fields or sensitive data allowed
Validation-Complete: Full JSON Schema validation ensures data integrity
Database-Agnostic: Unified representation across PostgreSQL, MySQL, SQLite, and MongoDB
Version-Aware: Format versioning for backward compatibility
Frictionless-Compatible: Based on Frictionless Data Table Schema specification

Schema Structure

Root Object

Every .dbsurveyor.json file contains these required fields:

{
  "format_version": "1.0",
  "database_info": { /* Database metadata */ },
  "tables": [ /* Table definitions */ ],
  "collection_metadata": { /* Collection process info */ }
}

Required Fields

Field	Type	Description
`format_version`	String	Schema format version (currently “1.0”)
`database_info`	Object	Database-level information and status
`collection_metadata`	Object	Collection process metadata

Optional Fields

Field	Type	Description
`tables`	Array	Table definitions (default: empty array)
`views`	Array	View definitions
`indexes`	Array	Database indexes
`constraints`	Array	Database constraints
`procedures`	Array	Stored procedures
`functions`	Array	Database functions
`triggers`	Array	Database triggers
`custom_types`	Array	Custom data types
`samples`	Array	Data samples from tables

Database Information

The database_info object contains essential database metadata:

{
  "name": "production_db",
  "version": "13.7",
  "size_bytes": 1073741824,
  "encoding": "UTF8",
  "collation": "en_US.UTF-8",
  "owner": "dbadmin",
  "is_system_database": false,
  "access_level": "Full",
  "collection_status": "Success"
}

Access Levels

Full: Complete schema access with all metadata
Limited: Partial access due to permission constraints
None: No access to schema information

Collection Status

"Success": Schema collected successfully
{"Failed": {"error": "Permission denied"}}: Collection failed with reason
{"Skipped": {"reason": "System database"}}: Database skipped with explanation

Table Structure

Tables are defined with comprehensive metadata:

{
  "name": "users",
  "schema": "public",
  "columns": [
    {
      "name": "id",
      "data_type": {
        "Integer": {
          "bits": 32,
          "signed": true
        }
      },
      "is_nullable": false,
      "is_primary_key": true,
      "is_auto_increment": true,
      "ordinal_position": 1,
      "comment": "Unique user identifier"
    },
    {
      "name": "email",
      "data_type": {
        "String": {
          "max_length": 255
        }
      },
      "is_nullable": false,
      "ordinal_position": 2,
      "comment": "User email address"
    }
  ],
  "primary_key": {
    "name": "users_pkey",
    "columns": [
      "id"
    ]
  },
  "foreign_keys": [
    {
      "name": "users_profile_fk",
      "columns": [
        "profile_id"
      ],
      "referenced_table": "profiles",
      "referenced_columns": [
        "id"
      ],
      "on_delete": "Cascade",
      "on_update": "Cascade"
    }
  ],
  "indexes": [
    {
      "name": "users_email_idx",
      "columns": [
        {
          "name": "email",
          "order": "asc"
        }
      ],
      "is_unique": true,
      "index_type": "btree"
    }
  ],
  "comment": "User account information",
  "row_count": 50000
}

Data Type System

DBSurveyor uses a unified data type system that maps database-specific types to a common representation:

Basic Types

"Boolean"                    // Boolean values
"Date"                       // Date without time
"Json"                       // JSON data
"Uuid"                       // UUID/GUID values

String Types

{
  "String": {
    "max_length": 255        // Maximum character length
  }
}

Numeric Types

{
  "Integer": {
    "bits": 32,              // Bit size: 8, 16, 32, 64, 128
    "signed": true           // Signed or unsigned
  }
}

{
  "Float": {
    "precision": 53          // Floating point precision (1-53)
  }
}

Date/Time Types

{
  "DateTime": {
    "with_timezone": true    // Includes timezone information
  }
}

{
  "Time": {
    "with_timezone": false   // Time without timezone
  }
}

Complex Types

{
  "Array": {
    "element_type": "String" // Array element type
  }
}

{
  "Binary": {
    "max_length": 1024       // Maximum binary length
  }
}

{
  "Custom": {
    "type_name": "geometry"  // Database-specific custom type
  }
}

Constraints and Relationships

Primary Keys

{
  "name": "users_pkey",
  "columns": [
    "id"
  ]
}

Foreign Keys

{
  "name": "orders_user_fk",
  "columns": [
    "user_id"
  ],
  "referenced_table": "users",
  "referenced_schema": "public",
  "referenced_columns": [
    "id"
  ],
  "on_delete": "Cascade",
  "on_update": "Restrict"
}

Referential Actions:

Cascade: Delete/update cascades to related records
SetNull: Set foreign key to NULL
SetDefault: Set foreign key to default value
Restrict: Prevent deletion/update if references exist
NoAction: No automatic action

Check Constraints

{
  "name": "users_age_check",
  "constraint_type": "Check",
  "definition": "age >= 0 AND age <= 150",
  "enforced": true
}

Indexes

Index definitions include performance characteristics:

{
  "name": "users_email_idx",
  "table_name": "users",
  "schema": "public",
  "columns": [
    {
      "name": "email",
      "order": "asc",
      "nulls_order": "last"
    }
  ],
  "is_unique": true,
  "is_primary": false,
  "index_type": "btree",
  "comment": "Unique index on email for fast lookups"
}

Index Types:

btree: Balanced tree (default)
hash: Hash-based index
gin: Generalized inverted index
gist: Generalized search tree
spgist: Space-partitioned GiST

Data Sampling

Optional data samples provide insight into actual data:

{
  "samples": [
    {
      "table_name": "users",
      "schema_name": "public",
      "rows": [
        {
          "id": 1001,
          "email": "user1001@example.com",
          "created_at": "2024-01-15T09:00:00Z"
        },
        {
          "id": 1002,
          "email": "user1002@example.com",
          "created_at": "2024-01-15T09:15:00Z"
        }
      ],
      "sample_size": 2,
      "total_rows": 50000,
      "sampling_strategy": {
        "MostRecent": {
          "limit": 10
        }
      },
      "collected_at": "2024-01-15T10:30:00Z",
      "warnings": [
        "Large table - limited sample collected"
      ],
      "sample_status": "Complete"
    }
  ]
}

Table Sample Fields

Field	Type	Required	Description
`table_name`	String	Yes	Name of the sampled table
`schema_name`	String	No	Schema/database name (null for databases without schemas)
`rows`	Array	Yes	Array of sampled row data as JSON objects
`sample_size`	Integer	Yes	Number of rows actually sampled
`total_rows`	Integer	No	Estimated total row count in the table
`sampling_strategy`	Object/String	Yes	Strategy used for sampling (see below)
`collected_at`	String	Yes	ISO 8601 timestamp of when sample was collected
`warnings`	Array	Yes	Array of warning messages (empty if no warnings)
`sample_status`	String/Object	No	Status of the sampling operation (see below)

Sampling Strategies

{"MostRecent": {"limit": 10}}: Latest N rows
{"Random": {"limit": 100}}: Random sample of N rows
"None": No sampling performed

Sample Status

The optional sample_status field tracks the outcome of the sampling operation. This field is backward-compatible and will be omitted when not set.

Status Values:

"Complete": Sampling completed successfully
{"PartialRetry": {"original_limit": 100}}: Sampling partially completed with a reduced limit due to constraints or errors
{"Skipped": {"reason": "Not implemented for this database"}}: Sampling was skipped with an explanation

Example with Complete Status:

{
  "table_name": "orders",
  "schema_name": "public",
  "rows": [...],
  "sample_size": 100,
  "sampling_strategy": {"MostRecent": {"limit": 100}},
  "collected_at": "2024-01-15T10:30:00Z",
  "warnings": [],
  "sample_status": "Complete"
}

Example with Skipped Status:

{
  "table_name": "large_archive",
  "schema_name": null,
  "rows": [],
  "sample_size": 0,
  "sampling_strategy": "None",
  "collected_at": "2024-01-15T10:30:00Z",
  "warnings": [],
  "sample_status": {
    "Skipped": {
      "reason": "Table exceeds size threshold"
    }
  }
}

Backward Compatibility Note: The sample_status field is optional and omitted when not explicitly set. Older schema files without this field remain fully compatible.

Multi-Database Collections

For server-level collections, the format supports multiple databases:

{
  "format_version": "1.0",
  "server_info": {
    "server_type": "PostgreSQL",
    "version": "13.7",
    "host": "localhost",
    "port": 5432,
    "total_databases": 5,
    "collected_databases": 3,
    "system_databases_excluded": 2,
    "connection_user": "dbadmin",
    "has_superuser_privileges": true,
    "collection_mode": {
      "MultiDatabase": {
        "discovered": 5,
        "collected": 3,
        "failed": 0
      }
    }
  },
  "databases": [
    /* Individual database schemas */
  ],
  "collection_metadata": {
    "collected_at": "2024-01-15T10:30:00Z",
    "collection_duration_ms": 2500,
    "collector_version": "1.0.0",
    "warnings": []
  }
}

Collection Metadata

Every schema file includes metadata about the collection process:

{
  "collection_metadata": {
    "collected_at": "2024-01-15T10:30:00Z",
    "collection_duration_ms": 1500,
    "collector_version": "1.0.0",
    "warnings": [
      "Large table 'audit_logs' - collection took 45 seconds",
      "Custom type 'geometry' not fully supported"
    ],
    "collector_options": {
      "include_system_tables": false,
      "sample_data": true,
      "max_sample_size": 1000
    }
  }
}

Validation Rules

Security Validation

The schema enforces strict security rules:

No credential fields: Field names cannot contain password, secret, token, etc.
No connection strings: Database URLs are automatically filtered
No sensitive patterns: Common credential patterns are rejected
Sanitized output: All error messages are credential-free

Data Validation

String lengths: Maximum 255 characters for names, 1000 for comments
Array limits: Maximum 1000 items in arrays
Nesting depth: Maximum 10 levels of object nesting
File size: Maximum 100MB per schema file

Type Validation

Required fields: All mandatory fields must be present
Data type consistency: Values must match declared types
Constraint validation: Foreign keys must reference valid tables
Index validation: Index columns must exist in referenced table

Usage Examples

Basic Schema Collection

# Collect PostgreSQL schema
dbsurveyor-collect postgres://user:pass@localhost/mydb --output schema.json

# Validate the output
dbsurveyor schema schema.json --validate

# Generate documentation
dbsurveyor schema schema.json --format markdown --output schema.md

Multi-Database Collection

# Collect all databases on server
dbsurveyor-collect postgres://admin:pass@localhost --all-databases --output server_schema.json

# Process specific database
dbsurveyor schema server_schema.json --database app_db --format json --output app_schema.json

Schema Validation

# Validate against JSON Schema
dbsurveyor validate schema.json

# Check for specific issues
dbsurveyor validate schema.json --check-security --check-constraints

Error Handling

Validation Errors

When validation fails, DBSurveyor provides detailed error information:

{
  "validation_errors": [
    {
      "path": "/tables/0/columns/1/data_type",
      "message": "Invalid data type: expected String, Integer, Boolean, Date, Json, Uuid, Array, Binary, Custom, or DateTime",
      "value": "VARCHAR",
      "suggestion": "Use {\"String\": {\"max_length\": 255}} instead"
    }
  ]
}

Collection Warnings

Warnings are included in the metadata for non-critical issues:

{
  "warnings": [
    "Large table 'audit_logs' (1.2M rows) - collection took 45 seconds",
    "Custom type 'geometry' not fully supported - using Custom type",
    "Table 'temp_users' appears to be temporary - may not persist"
  ]
}

Version Compatibility

Current Version: 1.0

Format: Stable and fully supported
Validation: Complete JSON Schema validation
Features: All documented features available
Backward Compatibility: N/A (first version)

Future Versions

The schema is designed for evolution:

Additive Changes: New optional fields can be added
Version Detection: Format version enables version-specific handling
Migration Support: Tools will support upgrading between versions
Deprecation Path: Old fields will be marked before removal

Integration

Documentation Tools

The schema format integrates with all DBSurveyor tools:

dbsurveyor-collect: Generates schema files
dbsurveyor: Processes and validates schemas
dbsurveyor-docs: Generates documentation from schemas
dbsurveyor-validate: Standalone validation tool

External Tools

The format is compatible with:

JSON Schema validators: jsonschema, ajv, etc.
Data analysis tools: pandas, jq, etc.
Documentation generators: Docusaurus, MkDocs, etc.
CI/CD pipelines: GitHub Actions, GitLab CI, etc.

Best Practices

Schema Collection

Use descriptive names: Avoid generic names like “db1”, “test”
Include comments: Add meaningful descriptions for tables and columns
Sample strategically: Use sampling for large tables to avoid huge files
Validate early: Check schemas immediately after collection

Schema Storage

Version control: Track schema changes in Git
Backup regularly: Keep historical schema versions
Compress large files: Use .zst compression for schemas >1MB
Secure access: Limit access to production schemas

Schema Processing

Validate inputs: Always validate before processing
Handle errors gracefully: Check collection status before proceeding
Monitor performance: Track collection times for optimization
Document changes: Keep records of schema evolution

Troubleshooting

Common Issues

Collection Fails with Permission Error

# Check database user privileges
dbsurveyor-collect postgres://user:pass@localhost/db --test-connection

# Verify user has SELECT on information_schema
GRANT SELECT ON ALL TABLES IN SCHEMA information_schema TO username;

Large Schema File Size

# Use compression
dbsurveyor-collect postgres://localhost/db --compress

# Limit data sampling
dbsurveyor-collect postgres://localhost/db --max-sample-size 100

Validation Errors

# Check schema format
dbsurveyor validate schema.json --verbose

# Fix common issues
dbsurveyor fix schema.json --output fixed_schema.json

Performance Optimization

Connection pooling: Use connection pooling for large databases
Parallel collection: Collect multiple databases simultaneously
Selective sampling: Only sample essential tables
Incremental updates: Collect only changed schemas

This specification provides a complete reference for the .dbsurveyor.json format, ensuring consistent, validated, and secure schema collection across all supported database types.

Keyboard shortcuts