Databricks Connector Integration Guide
Overview
The Databricks Connector automatically synchronizes data between Databricks and cloud storage. It monitors changes and handles bidirectional sync.
Automatic Table Change Discovery
The connector monitors Databricks tables and detects:
- New rows added to tables
- Table modifications since the last sync
- Unlabeled data ready for processing
When changes are detected, it exports the data to CSV files for labeling or processing.
Automatic Write-Back for Labeled Data
After data is labeled, the connector:
- Detects when labeled CSV files are available
- Automatically imports them back into Databricks tables
- Uses MERGE to update existing rows and insert new ones
Volume File Discovery and Sync
For Databricks Unity Catalog volumes, the connector:
- Discovers new media files and other files in storage
- Automatically syncs them back to Databricks tables or volumes
- Preserves directory structure
- Only processes files modified since the last sync
Key Features
- Automatic Discovery: Monitors Databricks tables and storage for changes
- Incremental Processing: Only processes new or modified data
- Background Processing: Runs automatically without manual intervention
- Data Integrity: Uses MERGE operations to ensure accurate updates
Once configured, the connector runs continuously, detecting changes and syncing data automatically.
Connection Setup
Step 1: Create a Service Principal in Databricks
Service principals authenticate the connector to your Databricks workspace using OAuth M2M.
1.1 Create the Service Principal
- In Databricks, go to Account Settings → User management → Service Principals
- Click Add Service Principal
- Create an OAuth secret:
- Click Generate Secret
- Save the Client Secret immediately (it won’t be shown again)
1.2 Grant Required Permissions
Grant the service principal:
- SQL Data Access: Read/write access to tables
- Workspace Access: Access to the workspace
- Unity Catalog Permissions: Read/write access to catalogs, schemas, and volumes
For Table Operations:
- Access to the catalog and schema containing your tables
- SELECT, INSERT, UPDATE, DELETE on target tables
For Volume Operations:
- Access to Unity Catalog volumes
- Read/write permissions on the target volume
Step 2: Register the Service Principal
Register the service principal via the API with your credentials:
POST /api/databricks/service-principals
Authorization: Bearer <your-token>
Content-Type: application/json
{
"workspace_url": "https://your-workspace.cloud.databricks.com",
"client_id": "your-application-client-id",
"client_secret": "your-oauth-client-secret"
}
Step 3: Verify the Connection
List service principals to confirm registration:
GET /api/databricks/service-principals
Authorization: Bearer <your-token>
Configuring Data Synchronization
Configure automatic synchronization between Databricks and cloud storage. The connector monitors changes and syncs data automatically.
Understanding Sync Types
| Type | What It Does | Use Case |
table | Exports Databricks tables → CSV files | Export data for labeling or processing |
table_reverse | Imports labeled CSV files → Databricks tables | Write labeled data back to tables |
volume_forward | Syncs files from Databricks volumes → storage | Export volume files |
volume_reverse | Syncs files from storage → Databricks volumes | Import files into volumes |
Table Export Configuration (Databricks → Storage)
Exports Databricks tables to CSV files when changes are detected.
POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "User Data Export",
"description": "Export user data for labeling",
"config_type": "table",
"databricks_path": "catalog.schema.table_name",
"service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
"pk_column": "user_id",
"label_column": "label"
}
Required Fields:
name: Configuration nameconfig_type:"table"databricks_path: Full table path (catalog.schema.table)service_principal_id: Service principal ID from connection setup
Optional Fields:
description: Descriptionpk_column: Primary key column (default:"id")label_column: Label column (default:"label")
How It Works:
- Monitors the table for changes
- Creates CSV files automatically
- Tracks last sync time for incremental exports
Result: CSV files are created automatically when new or modified data is detected in your Databricks table.
Table Import Configuration (Storage → Databricks)
Imports labeled CSV files back into Databricks tables.
POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Labeled Data Import",
"description": "Import labeled data back to Databricks",
"config_type": "table_reverse",
"databricks_path": "catalog.schema.table_name",
"service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
"pk_column": "id",
"label_column": "label",
"use_staging": true,
"staging_volume_path": "/Volumes/catalog/schema/staging_volume"
}
Required Fields:
name: Configuration nameconfig_type:"table_reverse"databricks_path: Full table path (catalog.schema.table)service_principal_id: Service principal IDpk_column: Primary key column for matching rowslabel_column: Column that will be updated
Optional Fields:
description: Descriptionuse_staging: Use staging tables for large operations (recommended for >100K rows)staging_volume_path: Unity Catalog volume path for staging (required ifuse_stagingis true)
How It Works:
- Detects when labeled CSV files are available
- Downloads and processes the CSV files
- Uses MERGE to update existing rows and insert new ones
- Updates based on the primary key column
Result: Labeled data is automatically merged into your Databricks table, updating existing rows and inserting new ones.
Volume File Sync Configuration
Syncs files between Databricks Unity Catalog volumes and cloud storage.
Import Files to Databricks Volume
POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Model Artifacts Import",
"description": "Sync ML model files to Databricks volume",
"config_type": "volume_reverse",
"databricks_path": "/Volumes/catalog/schema/model_volume",
"service_principal_id": "550e8400-e29b-41d4-a716-446655440000"
}
Required Fields:
name: Configuration nameconfig_type:"volume_reverse"(storage → Databricks) or"volume_forward"(Databricks → storage)databricks_path: Full Unity Catalog volume path starting with/Volumes/service_principal_id: Service principal ID
How It Works:
- Monitors storage for new or modified files
- Syncs files to the Databricks volume
- Preserves directory structure
- Only processes files modified since the last sync
Result: Files are automatically synchronized to your Databricks volume, maintaining directory structure and only processing new or modified files.
Complete Example: Data Labeling Workflow
Set up both export and import for a labeling workflow:
Step 1: Export Configuration
{
"name": "Export Unlabeled Data",
"config_type": "table",
"databricks_path": "production.data.users",
"service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
"pk_column": "user_id",
"label_column": "classification"
}
Step 2: Import Configuration
{
"name": "Import Labeled Data",
"config_type": "table_reverse",
"databricks_path": "production.data.users",
"service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
"pk_column": "user_id",
"label_column": "classification",
"use_staging": true,
"staging_volume_path": "/Volumes/production/staging/temp"
}
Automated Workflow:
- Export config detects new unlabeled rows → automatically exports to CSV files
- CSV files are processed through the labeling platform where data is labeled
- Labeled CSV files are placed back in cloud storage
- Import config detects labeled files → automatically merges them back into the table
Once both configurations are active, the system handles the complete cycle automatically: detecting new data, exporting it to CSV, and importing the labeled data back into your Databricks table after it’s been processed through the labeling platform.
Complete Example: Data Labeling Workflow
Scenario 1: Simple Table Export
- Export unlabeled data for external processing
- Config:
tabletype withpk_columnandlabel_column - Result: CSV files created automatically when new data arrives
Scenario 2: Labeling Pipeline
- Export → Label → Import workflow
- Configs:
tablefor export,table_reversefor import - Result: Automated bidirectional sync
Scenario 3: File Synchronization
- Sync media files or model artifacts
- Config:
volume_reversetype - Result: Files automatically synced to Databricks volumes
Scenario 4: Large Dataset Updates
- Handle large MERGE operations
- Config:
table_reversewithuse_staging: true - Result: Efficient processing of large datasets without errors
Once configured, the connector monitors and syncs data automatically.