Databricks
The following setup allows Alvin to access your Databricks metadata and query history, without being able to touch the underlying data*.
This setup only supports Databricks with Unity Catalog enabled, for other types of Databricks environment please get in touch with our support. * In order for Alvin to extract and monitor data volumes of tables we ask for additional permission which is specified at step 6, which is an optional step that enables additional features in Alvin.
1. Generate a Databricks access token
1.1 - Create Service Principal
Under Workspace settings / Identity and access / Service Principals, create a new service principal, may call it databricks_unity_catalog_extractor
.
Get the displayed value for Application Id, for example:
1d62fbf3-2a96-44bd-942b-55f89cd38a77
Make sure the following Entitlements are enabled:
1.2 - Grant token usage to service principal in workspace
Follow the instructions here to give service principal permissions to use access tokens.
Make sure the service principal created at step 1.1 has Can Use permission under Token Usage.
1.3 - Generate an access token for service principal
Follow the instructions here to generate an access token for the service principal. If you want the connection to Databricks to be uninterrupted by the token expiring, set lifetime_seconds
to null
to prevent the token from expiring. Save this access token somewhere safe.
Example:
You will need the generated <my-token_value>
to complete the connection setup later on.
2. Grant permissions to service principal on each catalog you want Alvin to extract
Run this with an user that has access to GRANT permissions, usually an ADMIN user, giving the following permissions to the service principal you created at step 1.1:
3. Make sure system tables are enabled
These steps must be executed only once, if they have never been executed before:https://docs.databricks.com/en/admin/system-tables/index.html#enable
Example of commands, how to list the available system schemas:
Enabling the schemas used by Alvin:
4. Create a Databricks SQL Warehouse for Alvin
If you have a non production warehouse, you may reuse it for Alvin, but the recommended approach is to create a new one.
Follow the instructions here to create a SQL Warehouse for Alvin to use. You will use the
Host
,Port
andHTTP path
from the 'Connection details' tab when creating the connection to Databricks in Alvin.Click the 'Permissions' button and give the Alvin service principal 'Can use' permissions.
5. Add connection to Alvin
Create a new connection here. Make sure the SQL Warehouse is up and running before hitting Test Connection, otherwise it might take a long time to validate the connection.
6. Additional row counts permissions (Optional)
Databricks Unity Catalog does not provide a less granular permission such as READ_METADATA
as it had on the hive_metastore.
In order for Alvin to extract the number of rows and bytes on tables, the following permission must be granted on catalog, schema or individual table levels:
Alvin only runs SELECT count
aggregations and DESCRIBE
commands on tables, which can be audited in the Alvin user environment.
7. Whitelist Alvin IP (Optional)
If your organization restricts Databricks access to a specific set of IP addresses, Alvin will only access your Databricks through the following IP, add it to your Allowed IP Addresses list: 34.159.141.113
Last updated