Hey there,
Great to see Marmot here and I'm a huge fan of your project. Recently, we deployed a catalog but we went with open-metadata https://open-metadata.org/ another amazing project.
What we missed on marmot was existing integrations with Airflow and other plugins like Tableau, PowerBI etc as well as other features such as sso, mcp etc.
We're an enterprise and needed a more mature product. Fingers crossed marmot reaches there soon.
That's great to know, I wasn't aware anybody even attempted to used it yet! I'm currently in the process of overhauling the Plugin system, it's been quite hard to test some enterprise closed-source integrations like Tableau and Snowflake to build out plugins.
SSO is sort kind of available, but undocumented, it currently only supports Okta but I'm working on fleshing out a lot of this in the next big release (along with MCP)
That's useful feedback. Charlie, what's the process for adding integrations? A tutorial would be great. The plugin links here don't work: https://marmotdata.io/docs/Plugins/
Hey, there's some documentation around creating plugins here. It's relatively simple and involves adding a new Go package to the repo. Currently they have to be compiled into the Binary but I'd like to support external plugins at some point https://marmotdata.io/docs/Develop/creating-plugins
Also, thanks for pointing out the issue with the docs, I'll get that fixed!
[dead]
Hey HN, I wanted to show off my project Marmot! I decided to build Marmot after discovering a lot of data catalogs can be complex and require many external dependencies such as Kafka, Elasticsearch or an external orchestrator like Airflow.
Marmot is a single Go binary backed by Postgres. That's it!
It already supports:
Full-text search across tables, topics, queues, buckets, APIs
Glossary and asset to term associations
Flexible API so it can support almost any data asset!
Terraform/Pulumi/CLI for managing a catalog-as-code
How does this get the maps of the data flows and so on? Does it require read credentials to each data silo, or is there a manual mapping process?
It supports either, I didn't want to restrict people to just one method of getting their catalog populated. The CLI and Plugin system works on needing read credentials to a given Service, it then populates the catalog with those assets. Any lineage links currently need to be done manually (unless they're part of the same plugin).
Otherwise, you can integrate with your existing IaC pipelines using Terraform or Pulumi to populate the catalog at deploy time instead of needing to scrape a bunch of services.
When should you reach for a data catalog via a data warehouse or data lake? If you are choosing a data catalog this is probably obvious to you, if you just happened on this HN post less so.
Also, what key decisions do other data catalogs make via your choices? What led to those decisions and what is the benefit to users?
It depends on your ecosystem. If everything lives under one vendor their native catalog will probably work really well for you. But most of the time (especially for older orgs) there's usually a huge fragmented ecosystem of data assets that aren't easily discoverable and spread across multiple teams and vendors.
I like to think of Marmot as more of "operational" catalog with more of a focus on usability for individual contributors and not just data engineers. The key focus being on simplicity, in terms of both deployments and usability.
This looks fantastic! I’ll need to explore building a SQLite / D1 plugin to consolidate all my worker data
How's it different from existing open source data catalogs like amundsen.io?
Amundsen has two databases and three services in its architecture diagram. For me, that's a smell that you now have risk of inconsistency between the two, and you may have to learn how to tune elasticsearch and Neo4j...
Versus the conceptually simpler "one binary, one container, one storage volume/database" model.
I acknowledge it's a false choice and a semi-silly thing to fixate on (how do you perf-tune ingestion queue problems vs write problems vs read problems for a go binary?)..
But, like, I have 10 different systems I'm already debugging.
Adding another one like a data catalog that is supposed to make life easier and discovering I now have 5-subsystems-in-a-trenchcoat to possibly need to debug means I'm spending even more time on babysitting the metadata manager rather than doing data engineering _for the business_
Hey there, Great to see Marmot here and I'm a huge fan of your project. Recently, we deployed a catalog but we went with open-metadata https://open-metadata.org/ another amazing project.
What we missed on marmot was existing integrations with Airflow and other plugins like Tableau, PowerBI etc as well as other features such as sso, mcp etc.
We're an enterprise and needed a more mature product. Fingers crossed marmot reaches there soon.
That's great to know, I wasn't aware anybody even attempted to used it yet! I'm currently in the process of overhauling the Plugin system, it's been quite hard to test some enterprise closed-source integrations like Tableau and Snowflake to build out plugins.
SSO is sort kind of available, but undocumented, it currently only supports Okta but I'm working on fleshing out a lot of this in the next big release (along with MCP)
That's useful feedback. Charlie, what's the process for adding integrations? A tutorial would be great. The plugin links here don't work: https://marmotdata.io/docs/Plugins/
Hey, there's some documentation around creating plugins here. It's relatively simple and involves adding a new Go package to the repo. Currently they have to be compiled into the Binary but I'd like to support external plugins at some point https://marmotdata.io/docs/Develop/creating-plugins
Also, thanks for pointing out the issue with the docs, I'll get that fixed!
[dead]
Hey HN, I wanted to show off my project Marmot! I decided to build Marmot after discovering a lot of data catalogs can be complex and require many external dependencies such as Kafka, Elasticsearch or an external orchestrator like Airflow.
Marmot is a single Go binary backed by Postgres. That's it!
It already supports: Full-text search across tables, topics, queues, buckets, APIs Glossary and asset to term associations
Flexible API so it can support almost any data asset!
Terraform/Pulumi/CLI for managing a catalog-as-code
10+ Plugins (and growing)
Live demo: https://demo.marmotdata.io
How does this get the maps of the data flows and so on? Does it require read credentials to each data silo, or is there a manual mapping process?
It supports either, I didn't want to restrict people to just one method of getting their catalog populated. The CLI and Plugin system works on needing read credentials to a given Service, it then populates the catalog with those assets. Any lineage links currently need to be done manually (unless they're part of the same plugin). Otherwise, you can integrate with your existing IaC pipelines using Terraform or Pulumi to populate the catalog at deploy time instead of needing to scrape a bunch of services.
When should you reach for a data catalog via a data warehouse or data lake? If you are choosing a data catalog this is probably obvious to you, if you just happened on this HN post less so.
Also, what key decisions do other data catalogs make via your choices? What led to those decisions and what is the benefit to users?
It depends on your ecosystem. If everything lives under one vendor their native catalog will probably work really well for you. But most of the time (especially for older orgs) there's usually a huge fragmented ecosystem of data assets that aren't easily discoverable and spread across multiple teams and vendors.
I like to think of Marmot as more of "operational" catalog with more of a focus on usability for individual contributors and not just data engineers. The key focus being on simplicity, in terms of both deployments and usability.
This looks fantastic! I’ll need to explore building a SQLite / D1 plugin to consolidate all my worker data
How's it different from existing open source data catalogs like amundsen.io?
Amundsen has two databases and three services in its architecture diagram. For me, that's a smell that you now have risk of inconsistency between the two, and you may have to learn how to tune elasticsearch and Neo4j...
Versus the conceptually simpler "one binary, one container, one storage volume/database" model.
I acknowledge it's a false choice and a semi-silly thing to fixate on (how do you perf-tune ingestion queue problems vs write problems vs read problems for a go binary?)..
But, like, I have 10 different systems I'm already debugging.
Adding another one like a data catalog that is supposed to make life easier and discovering I now have 5-subsystems-in-a-trenchcoat to possibly need to debug means I'm spending even more time on babysitting the metadata manager rather than doing data engineering _for the business_
https://www.amundsen.io/amundsen/architecture/