Storage
Last updated
Last updated
We’ve all been there – drowning in an ever-increasing amount of tables, struggling to understand what this table is used for - who added it, who is using it and what for: the list goes on. It’s incredibly easy to add new pipelines to create more data, but ever harder for data teams to clean up and curate the data. Storage insights leverages and unites all available metadata about tables to allow fast decision-making without all the detective work.
That the problem of removing unused or duplicate data is an increasingly time-consuming and error-prone process. Unused and duplicate data is problematic for a range of reasons:
Compliance
Cost
No source of truth
Lack of trust
Overhead in communication
…and the list goes on.
Deleting a table is a lot more involved than just issuing the statement, it’s about understanding where the table “sits” in the overall pipeline (is it a source, staging or metric table), who “owns” it, who is using it, and what for, and if the table is connected to any dashboard, processes or systems outside of the warehouse. Once this is done, the process of engaging with potential stakeholders starts, and after that again, the actual work can be done. Phew..
It became clear to us that this process can be streamlined by bringing all of this information together. In true Alvin spirit, we collected, aggregated and summarized all the data in a neat list for our users.
Getting hands on:
What you will see here is screenshots of our own production environment and how we have used the product to cut our storage cost and remove unused tables. We have, as the chart indicates, actively used our product to reduce our storage and processing costs by over 50%.
The first thing I’m doing is to hover over the badge in the “Analytical users” column of an arbitrarily selected table and this shows me the analytical users of this table. This is important; analytical (i.e SELECT) usage is very different than if the table is used as part of a pipeline. This very quickly gives you an idea of the users who are actually querying the table.
Furthermore, we have analytical usage count and downstream usage count of the table. Showing downstream analytical usage is also important, as it's not uncommon to have tables that are important as pipeline sources or intermediary tables. Deleting such tables would break pipelines and thus tables and dashboards further down the pipeline.
The “Compute cost” column indicates the cost of queries building this table over the time period selected.
Powerful grouping and filtering capabilities
We’re not done yet, however. The tool offers powerful slicing and dicing capabilities. A commonly requested feature is to understand which schemas or even databases/projects have low utilization, and with a simple click you can aggregate all the stats across multiple dimensions, like schema, database or even owner/team/domain. You can also filter across these dimensions to drill down into a specific team, domain or database/project.
To make life even easier, the most common workflows have predefined filters in the sidebar, so you can just jump straight in!
Source tables
Intermediary pipeline tables
Pipeline output tables
“Island” Tables that are not part of any pipeline but used for analytical purposes
And lastly, unused tables.