Databricks

The following setup allows Alvin to access your Databricks metadata and query history, without being able to touch the underlying data*.

PreviousSnowflake NextBusiness Intelligence

Last updated 9 months ago

Databricks

The following setup allows Alvin to access your Databricks metadata and query history, without being able to touch the underlying data*.

This setup only supports Databricks with Unity Catalog enabled, for other types of Databricks environment please get in touch with our support. * In order for Alvin to extract and monitor data volumes of tables we ask for additional permission which is specified at step 6, which is an optional step that enables additional features in Alvin.

1. Generate a Databricks access token

1.1 - Create Service Principal

Under Workspace settings / Identity and access / Service Principals, create a new service principal, may call it databricks_unity_catalog_extractor.

Get the displayed value for Application Id, for example: 1d62fbf3-2a96-44bd-942b-55f89cd38a77

Make sure the following Entitlements are enabled:

Databricks SQL access
Workspace access

1.2 - Grant token usage to service principal in workspace

Make sure the service principal created at step 1.1 has Can Use permission under Token Usage.

1.3 - Generate an access token for service principal

Example:

databricks token-management create-obo-token {service_principal_application_id}

You will need the generated <my-token_value> to complete the connection setup later on.

{
  "token_info": {
    "comment":"",
    "created_by_id":571098782507535,
    "created_by_username":"user@company.com",
    "creation_time":1721859192753,
    "expiry_time":-1,
    "owner_id":6607881346135267,
    "token_id":"<my-token_id>"
  },
  "token_value":"<my-token_value>"
}

2. Grant permissions to service principal on each catalog you want Alvin to extract

Run this with an user that has access to GRANT permissions, usually an ADMIN user, giving the following permissions to the service principal you created at step 1.1:

-- this step must run for each catalog that has unity catalog tables
GRANT USE_CATALOG ON CATALOG prod TO `{service_principal_application_id}`;
GRANT USE_SCHEMA ON CATALOG prod TO `{service_principal_application_id}`;
GRANT BROWSE ON CATALOG prod TO `{service_principal_application_id}`;

-- this step must run once for the system catalog, 
-- only for the system catalog you must also give – 
-- SELECT permission, to extract the system metadata tables, 
-- these tables contain only metadata.
GRANT USE_CATALOG ON CATALOG system TO `{service_principal_application_id}`;
GRANT USE_SCHEMA ON CATALOG system TO `{service_principal_application_id}`;
GRANT SELECT ON CATALOG system TO `{service_principal_application_id}`;

3. Make sure system tables are enabled

Example of commands, how to list the available system schemas:

curl -v -X GET -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas"

Enabling the schemas used by Alvin:

# Query history:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/query"

# Billing:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/billing"

# Information Schema:
curl -v -X PUT -H "Authorization: Bearer {my_admin_token}" "https://{databricks_host}/api/2.0/unity-catalog/metastores/{metastore_id}/systemschemas/information_schema"

4. Create a Databricks SQL Warehouse for Alvin

If you have a non production warehouse, you may reuse it for Alvin, but the recommended approach is to create a new one.

Click the 'Permissions' button and give the Alvin service principal 'Can use' permissions.

5. Add connection to Alvin

6. Additional row counts permissions (Optional)

In order for Alvin to extract the number of rows and bytes on tables, the following permission must be granted on catalog, schema or individual table levels:

-- this step must run for each catalog that has unity catalog tables
GRANT SELECT ON CATALOG prod TO `{service_principal_application_id}`;

Alvin only runs SELECT count aggregations and DESCRIBE commands on tables, which can be audited in the Alvin user environment.

7. Whitelist Alvin IP (Optional)

If your organization restricts Databricks access to a specific set of IP addresses, Alvin will only access your Databricks through the following IP, add it to your Allowed IP Addresses list: 34.159.141.113

PreviousSnowflake NextBusiness Intelligence

Last updated 9 months ago