Databricks Connector Integration Guide

Overview

The Databricks Connector automatically synchronizes data between Databricks and cloud storage. It monitors changes and handles bidirectional sync.

Automatic Table Change Discovery

The connector monitors Databricks tables and detects:

New rows added to tables
Table modifications since the last sync
Unlabeled data ready for processing

When changes are detected, it exports the data to CSV files for labeling or processing.

Automatic Write-Back for Labeled Data

After data is labeled, the connector:

Detects when labeled CSV files are available
Automatically imports them back into Databricks tables
Uses MERGE to update existing rows and insert new ones

Volume File Discovery and Sync

For Databricks Unity Catalog volumes, the connector:

Discovers new media files and other files in storage
Automatically syncs them back to Databricks tables or volumes
Preserves directory structure
Only processes files modified since the last sync

Key Features

Automatic Discovery: Monitors Databricks tables and storage for changes
Incremental Processing: Only processes new or modified data
Background Processing: Runs automatically without manual intervention
Data Integrity: Uses MERGE operations to ensure accurate updates

Once configured, the connector runs continuously, detecting changes and syncing data automatically.

Connection Setup

Step 1: Create a Service Principal in Databricks

Service principals authenticate the connector to your Databricks workspace using OAuth M2M.

1.1 Create the Service Principal

In Databricks, go to Account Settings → User management → Service Principals
Click Add Service Principal
Create an OAuth secret:
- Click Generate Secret
- Save the Client Secret immediately (it won’t be shown again)

1.2 Grant Required Permissions

Grant the service principal:

SQL Data Access: Read/write access to tables
Workspace Access: Access to the workspace
Unity Catalog Permissions: Read/write access to catalogs, schemas, and volumes

For Table Operations:

Access to the catalog and schema containing your tables
SELECT, INSERT, UPDATE, DELETE on target tables

For Volume Operations:

Access to Unity Catalog volumes
Read/write permissions on the target volume

Step 2: Register the Service Principal

POST /api/databricks/service-principals
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "workspace_url": "https://your-workspace.cloud.databricks.com",
  "client_id": "your-application-client-id",
  "client_secret": "your-oauth-client-secret"
}

Step 3: Verify the Connection

List service principals to confirm registration:

GET /api/databricks/service-principals
Authorization: Bearer <your-token>

Configuring Data Synchronization

Configure automatic synchronization between Databricks and cloud storage. The connector monitors changes and syncs data automatically.

Understanding Sync Types

Type	What It Does	Use Case
`table`	Exports Databricks tables → CSV files	Export data for labeling or processing
`table_reverse`	Imports labeled CSV files → Databricks tables	Write labeled data back to tables
`volume_forward`	Syncs files from Databricks volumes → storage	Export volume files
`volume_reverse`	Syncs files from storage → Databricks volumes	Import files into volumes

Table Export Configuration (Databricks → Storage)

Exports Databricks tables to CSV files when changes are detected.

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "User Data Export",
  "description": "Export user data for labeling",
  "config_type": "table",
  "databricks_path": "catalog.schema.table_name",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "label"
}

Required Fields:

name: Configuration name
config_type: "table"
databricks_path: Full table path (catalog.schema.table)
service_principal_id: Service principal ID from connection setup

Optional Fields:

description: Description
pk_column: Primary key column (default: "id")
label_column: Label column (default: "label")

How It Works:

Monitors the table for changes
Creates CSV files automatically
Tracks last sync time for incremental exports

Result: CSV files are created automatically when new or modified data is detected in your Databricks table.

Table Import Configuration (Storage → Databricks)

Imports labeled CSV files back into Databricks tables.

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Labeled Data Import",
  "description": "Import labeled data back to Databricks",
  "config_type": "table_reverse",
  "databricks_path": "catalog.schema.table_name",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "id",
  "label_column": "label",
  "use_staging": true,
  "staging_volume_path": "/Volumes/catalog/schema/staging_volume"
}

Required Fields:

name: Configuration name
config_type: "table_reverse"
databricks_path: Full table path (catalog.schema.table)
service_principal_id: Service principal ID
pk_column: Primary key column for matching rows
label_column: Column that will be updated

Optional Fields:

description: Description
use_staging: Use staging tables for large operations (recommended for >100K rows)
staging_volume_path: Unity Catalog volume path for staging (required if use_staging is true)

How It Works:

Detects when labeled CSV files are available
Downloads and processes the CSV files
Uses MERGE to update existing rows and insert new ones
Updates based on the primary key column

Result: Labeled data is automatically merged into your Databricks table, updating existing rows and inserting new ones.

Volume File Sync Configuration

Syncs files between Databricks Unity Catalog volumes and cloud storage.

Import Files to Databricks Volume

POST /api/databricks/sync-configs
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Model Artifacts Import",
  "description": "Sync ML model files to Databricks volume",
  "config_type": "volume_reverse",
  "databricks_path": "/Volumes/catalog/schema/model_volume",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000"
}

Required Fields:

name: Configuration name
config_type: "volume_reverse" (storage → Databricks) or "volume_forward" (Databricks → storage)
databricks_path: Full Unity Catalog volume path starting with /Volumes/
service_principal_id: Service principal ID

How It Works:

Monitors storage for new or modified files
Syncs files to the Databricks volume
Preserves directory structure
Only processes files modified since the last sync

Result: Files are automatically synchronized to your Databricks volume, maintaining directory structure and only processing new or modified files.

Complete Example: Data Labeling Workflow

Set up both export and import for a labeling workflow:

Step 1: Export Configuration

{
  "name": "Export Unlabeled Data",
  "config_type": "table",
  "databricks_path": "production.data.users",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "classification"
}

Step 2: Import Configuration

{
  "name": "Import Labeled Data",
  "config_type": "table_reverse",
  "databricks_path": "production.data.users",
  "service_principal_id": "550e8400-e29b-41d4-a716-446655440000",
  "pk_column": "user_id",
  "label_column": "classification",
  "use_staging": true,
  "staging_volume_path": "/Volumes/production/staging/temp"
}

Automated Workflow:

Export config detects new unlabeled rows → automatically exports to CSV files
CSV files are processed through the labeling platform where data is labeled
Labeled CSV files are placed back in cloud storage
Import config detects labeled files → automatically merges them back into the table

Once both configurations are active, the system handles the complete cycle automatically: detecting new data, exporting it to CSV, and importing the labeled data back into your Databricks table after it’s been processed through the labeling platform.

Complete Example: Data Labeling Workflow

Scenario 1: Simple Table Export

Export unlabeled data for external processing
Config: table type with pk_column and label_column
Result: CSV files created automatically when new data arrives

Scenario 2: Labeling Pipeline

Export → Label → Import workflow
Configs: table for export, table_reverse for import
Result: Automated bidirectional sync

Scenario 3: File Synchronization

Sync media files or model artifacts
Config: volume_reverse type
Result: Files automatically synced to Databricks volumes

Scenario 4: Large Dataset Updates

Handle large MERGE operations
Config: table_reverse with use_staging: true
Result: Efficient processing of large datasets without errors

Once configured, the connector monitors and syncs data automatically.

Gen AI

AI/ML Models

Databricks Connector Integration Guide

Overview

Automatic Table Change Discovery

Automatic Write-Back for Labeled Data

Volume File Discovery and Sync

Key Features

Connection Setup

Step 1: Create a Service Principal in Databricks

1.1 Create the Service Principal

1.2 Grant Required Permissions

For Table Operations:

For Volume Operations:

Step 2: Register the Service Principal

Step 3: Verify the Connection

Configuring Data Synchronization

Understanding Sync Types

Table Export Configuration (Databricks → Storage)

Required Fields:

Optional Fields:

How It Works:

Table Import Configuration (Storage → Databricks)

Required Fields:

Optional Fields:

How It Works:

Volume File Sync Configuration

Import Files to Databricks Volume

Required Fields:

How It Works:

Complete Example: Data Labeling Workflow

Step 1: Export Configuration

Step 2: Import Configuration

Automated Workflow:

Complete Example: Data Labeling Workflow

Scenario 1: Simple Table Export

Scenario 2: Labeling Pipeline

Scenario 3: File Synchronization

Scenario 4: Large Dataset Updates