Databricks Connector Integration Guide

Overview

The Databricks Connector automatically synchronizes data between Databricks and cloud storage. It monitors changes and handles bidirectional sync.

The connector monitors Databricks tables and detects:

When changes are detected, it exports the data to CSV files for labeling or processing.

After data is labeled, the connector:

For Databricks Unity Catalog volumes, the connector:

Once configured, the connector runs continuously, detecting changes and syncing data automatically.

Connection Setup

Service principals authenticate the connector to your Databricks workspace using OAuth M2M.

  1. In Databricks, go to Account Settings â†’ User management â†’ Service Principals
  2. Click Add Service Principal
  3. Create an OAuth secret:
    • Click Generate Secret
    • Save the Client Secret immediately (it won’t be shown again)

Grant the service principal:

Register the service principal via the API with your credentials:

POST /api/databricks/service-principals
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "workspace_url": "https://your-workspace.cloud.databricks.com",
  "client_id": "your-application-client-id",
  "client_secret": "your-oauth-client-secret"
}

List service principals to confirm registration:

GET /api/databricks/service-principals
Authorization: Bearer <your-token>

Configuring Data Synchronization

Configure automatic synchronization between Databricks and cloud storage. The connector monitors changes and syncs data automatically.

TypeWhat It DoesUse Case
tableExports Databricks tables → CSV filesExport data for labeling or processing
table_reverseImports labeled CSV files → Databricks tablesWrite labeled data back to tables
volume_forwardSyncs files from Databricks volumes → storageExport volume files
volume_reverseSyncs files from storage → Databricks volumesImport files into volumes

Exports Databricks tables to CSV files when changes are detected.

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "User Data Export",
  "description": "Export user data for labeling",
  "config_type": "table",
  "databricks_path": "catalog.schema.table_name",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "label"
}

Result: CSV files are created automatically when new or modified data is detected in your Databricks table.

Imports labeled CSV files back into Databricks tables.

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Labeled Data Import",
  "description": "Import labeled data back to Databricks",
  "config_type": "table_reverse",
  "databricks_path": "catalog.schema.table_name",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "id",
  "label_column": "label",
  "use_staging": true,
  "staging_volume_path": "/Volumes/catalog/schema/staging_volume"
}

Result: Labeled data is automatically merged into your Databricks table, updating existing rows and inserting new ones.

Syncs files between Databricks Unity Catalog volumes and cloud storage.

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Model Artifacts Import",
  "description": "Sync ML model files to Databricks volume",
  "config_type": "volume_reverse",
  "databricks_path": "/Volumes/catalog/schema/model_volume",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000"
}

Result: Files are automatically synchronized to your Databricks volume, maintaining directory structure and only processing new or modified files.

Set up both export and import for a labeling workflow:

{
  "name": "Export Unlabeled Data",
  "config_type": "table",
  "databricks_path": "production.data.users",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "classification"
}
{
  "name": "Import Labeled Data",
  "config_type": "table_reverse",
  "databricks_path": "production.data.users",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "classification",
  "use_staging": true,
  "staging_volume_path": "/Volumes/production/staging/temp"
}
  1. Export config detects new unlabeled rows → automatically exports to CSV files
  2. CSV files are processed through the labeling platform where data is labeled
  3. Labeled CSV files are placed back in cloud storage
  4. Import config detects labeled files → automatically merges them back into the table

Once both configurations are active, the system handles the complete cycle automatically: detecting new data, exporting it to CSV, and importing the labeled data back into your Databricks table after it’s been processed through the labeling platform.

Once configured, the connector monitors and syncs data automatically.