Databricks

The following setup allows Alvin to access your Databricks metadata and query history, without being able to touch the underlying data*.

This setup only supports Databricks with Unity Catalog enabled, for other types of Databricks environment please get in touch with our support. * In order for Alvin to extract and monitor data volumes of tables we ask for additional permission which is specified at step 6, which is an optional step that enables additional features in Alvin.

1. Generate a Databricks access token

1.1 - Create Service Principal

Under Workspace settings / Identity and access / Service Principals, create a new service principal, may call it databricks_unity_catalog_extractor.

Get the displayed value for Application Id, for example: 1d62fbf3-2a96-44bd-942b-55f89cd38a77

Make sure the following Entitlements are enabled:

1.2 - Grant token usage to service principal in workspace

Follow the instructions here to give service principal permissions to use access tokens.

Make sure the service principal created at step 1.1 has Can Use permission under Token Usage.

1.3 - Generate an access token for service principal

Follow the instructions here to generate an access token for the service principal. If you want the connection to Databricks to be uninterrupted by the token expiring, set lifetime_seconds to null to prevent the token from expiring. Save this access token somewhere safe.

Example:

databricks token-management create-obo-token {service_principal_application_id}

You will need the generated <my-token_value> to complete the connection setup later on.

{
  "token_info": {
    "comment":"",
    "created_by_id":571098782507535,
    "created_by_username":"user@company.com",
    "creation_time":1721859192753,
    "expiry_time":-1,
    "owner_id":6607881346135267,
    "token_id":"<my-token_id>"
  },
  "token_value":"<my-token_value>"
}

2. Grant permissions to service principal on each catalog you want Alvin to extract

Run this with an user that has access to GRANT permissions, usually an ADMIN user, giving the following permissions to the service principal you created at step 1.1:

-- this step must run for each catalog that has unity catalog tables
GRANT USE_CATALOG ON CATALOG prod TO `{service_principal_application_id}`;
GRANT USE_SCHEMA ON CATALOG prod TO `{service_principal_application_id}`;
GRANT BROWSE ON CATALOG prod TO `{service_principal_application_id}`;

-- this step must run once for the system catalog, 
-- only for the system catalog you must also give – 
-- SELECT permission, to extract the system metadata tables, 
-- these tables contain only metadata.
GRANT USE_CATALOG ON CATALOG system TO `{service_principal_application_id}`;
GRANT USE_SCHEMA ON CATALOG system TO `{service_principal_application_id}`;
GRANT SELECT ON CATALOG system TO `{service_principal_application_id}`;

3. Make sure system tables are enabled

These steps must be executed only once, if they have never been executed before:https://docs.databricks.com/en/admin/system-tables/index.html#enable

Example of commands, how to list the available system schemas:

curl -v -X GET -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas"

Enabling the schemas used by Alvin:

# Query history:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/query"

# Billing:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/billing"

# Information Schema:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/information_schema"

4. Create a Databricks SQL Warehouse for Alvin

If you have a non production warehouse, you may reuse it for Alvin, but the recommended approach is to create a new one.

  1. Follow the instructions here to create a SQL Warehouse for Alvin to use. You will use the Host, Port and HTTP path from the 'Connection details' tab when creating the connection to Databricks in Alvin.

  2. Click the 'Permissions' button and give the Alvin service principal 'Can use' permissions.

5. Add connection to Alvin

Create a new connection here. Make sure the SQL Warehouse is up and running before hitting Test Connection, otherwise it might take a long time to validate the connection.

6. Additional row counts permissions (Optional)

Databricks Unity Catalog does not provide a less granular permission such as READ_METADATA as it had on the hive_metastore.

In order for Alvin to extract the number of rows and bytes on tables, the following permission must be granted on catalog, schema or individual table levels:

-- this step must run for each catalog that has unity catalog tables
GRANT SELECT ON CATALOG prod TO `{service_principal_application_id}`;

Alvin only runs SELECT count aggregations and DESCRIBE commands on tables, which can be audited in the Alvin user environment.

7. Whitelist Alvin IP (Optional)

If your organization restricts Databricks access to a specific set of IP addresses, Alvin will only access your Databricks through the following IP, add it to your Allowed IP Addresses list: 34.159.141.113

Last updated