Protocol poking
Part VIII - The Hoovers
- Hypothesis: Siloed databases duplicate efforts and reduce coordination.
- Methods: Enumerate databases and examine dependency properties and procedures.
- Conclusion: Database providers effectively curate important data but having closed and non-standardised backends prevents coordination.
(Continued from poking part 7)
The Survey
Let us examine some important domains to see what databases are present:
Categories:
- ABIs
- Signatures
- Names and Tags
- Appearances
ABIs
What is the interface for this deployed contract?
Database | Keys | Values | Format | Data open source | Growth mechanism | Distribution | Size | Link |
---|---|---|---|---|---|---|---|---|
Sourcify | Contract Address | ABI | JSON | Open source | Web UI, API | Free API, scrape, IPNS, github clone | 16GB | sourcify.dev |
Etherscan | Contract Address | ABI | JSON | Closed Source | Free API | Free API | - | etherscan.io |
Signatures
What is the human readable version of this method or event?
Database | Keys | Values | Format | Data open source | Growth mechanism | Distribution | Size | Link |
---|---|---|---|---|---|---|---|---|
Ethereum tags | Address | Text signature | JSON | Closed source | Free API | Free API | - | samczsun.com |
4Byte site | Hex signature | Text signature | .txt files | Closed source (but published to GH) | Web UI, API | Free API | 962_000 entries | 4byte.directory |
4byte github | Hex signature | Text signature | .txt files | Open source | Periodic PR using data from 4byte.directory | Clone github | 962_000 entries | github |
Topic0 | Hex event signature | Text event signature | .txt files | Open source | Clone and parse Sourcify registry | GH Clone | 7_800 entries | github |
etk-4Byte | Hex signature | Text signature | Hard coded rust crate | Open source | Clone 4Byte directory | GH Clone | - | crates.io |
Tags and Names
What is the meaningful label for this address?
Database | Keys | Values | Format | Data open source | Growth mechanism | Distribution | Size | Link |
---|---|---|---|---|---|---|---|---|
Ethereum tags | Address | Tags | JSON | Closed source | Free API | Free API | - | samczsun.com |
Trueblocks Names | Address | Name, Tags | custom binary file | Yes | Trueblocks-driven | Install trueblocks-core | 9.2MB | trueblocks.io |
Kleros names | Address | Tags | - | Open source | Token Curated Registry | - | - | kleros.io |
RolodETH | Address | Name, tags | JSON files | Open source | GH PR | GH Clone | 640_437 entries | github |
Scraped Etherscan labels | Address | Name, tags | JSON file | Open Source | GH PR | Clone | 3.39MB | github |
Address Appearances
Where did my address appear?
Database | Keys | Values | Format | Data open source | Growth mechanism | Distribution | Size | Link |
---|---|---|---|---|---|---|---|---|
Unchained Index | Address | Appearances (tx ids) | custom binary files | Open source | Create locally and share | IPFS via smart contract publisher | 80GB | trueblocks.io |
Etherscan | Address | Appearances (txs) | JSON | Closed source | Free API | Free API | - | etherscan.io |
(Gedankenexperiment) address-appearance-index* | Address | Appearances (tx ids) | SSZ files | Open source | parse Unchained Index locally and share | IPFS via smart contract publisher | 80GB | github spec |
Note: address-appearance-index is a prototype/conecept only. As seen in the table it is a derivative of the Unchained Index and thus duplicates effort to get the same data in a new format. I designed the index as the starting point in an exploration of distributable databases. It may be a wiser approach to lean more heavily into Unchained Index than try for a slightly different variation/derivative of it.
The Embedding
When a database is duplicated, the existing database infrastructure is left to waste.
For example, here is an example where the public 4byte registry is absorbed into a closed source database with a free API. The API is then embedded in an open source tool because it has better uptime characteristics.
https://github.com/foundry-rs/foundry/issues/1672
Now there are two competing entities trying to solve the same problem, in a way that detracts from the other - rather than strengthening it. This creates short term solutions that ultimately fail our decentralised needs: Trust me, I’ll keep maintaining this open API for free forever. It may be true, but it makes the open source tool vulnerable.
What if the API is shut down? Well you could try to return to the original sources of the hoovered data. But they may no longer exist because the customers drawn to the “more reliable” closed API and the effort was shut down.
Would it not be better if:
- The closed API that gains new data publishes it in a way that contributes to the original data?
- A Foundry user had the option of pinning parts of the database?
Agreeing on distributable database formats does not mean “everything has to be slow”. It means that:
- Users can contribute to hosting the database.
- Separate providers can contribute to the database.
The Formula
Databases can be redesigned so that all the above duplicated efforts could point to the same underlying content.
These entities:
- Samczsun’s Ethereum signatures
- Snake charmers’ 4Byte
- Williams’ Topic0
- Quilt’s etk-4Byte
Could all point to content identifiers (CIDs) for their databases. When they update their local database, they could use manifests to coordinate without communicating:
- Go to a manifest publishing contract
- Use the term “fourbyte” to get a list of IPNS names
- Check each IPNS to get a manifest
- Look at what each manifest contains. If another manifest has more data, use the CIDs to get that data.
- Now if you have additional data, create new additions with their own CIDs and preoduce a new mainfest.
- Publish your manifest under your INPS name.
Thus, separate parties could build the same database in a distributed way. They can provide the data via a free and performant API, but a user can also just use the manifests to get the latest data. Users can also pin.
The Formats
Each database type could have an agreed upon format for the values in key-value pairs.
This is a public declaration / specification where one states:
- A JSON format
- A plain string
- An SSZ container
Interested parties could all agree that the format for a key-value pair is the following json:
{
"key": "hex_xyz",
"value": "readable_text_xyz"
}
The Sealing
A database that seals old data and never touches it can share pieces with peers.
This is a public declaration / specification where one states the rules to decide how to group new key-value pairs:
- When a key-value pair does not fit in the current Volume, an must go in a new Volume
- Which Chapter the the key-value pair fits in
For example someone could publish a specification for fourbyte data:
- Fourbyte Volumes MUST contain EXACTLY 1_000 entries
- Fourbyte Chapters contain entries whose keys all start with the same hex character.
So if you had 1_200 entries to add to the database, you would know that you can only seal 1_000 of them into a Volume. Additionally, the volumes you publish will have 16 different chapters (0x0… to 0xf…).
So if 4byte.directory currently has 961,200 entries they know that at 962_000 they will publish a manifest containing all existing CIDs plus:
- volume 961_000
- volume 962_000 (4byte.directory's latest addition)
- chapter 0x0 CID xyz
- chapter 0x1 CID xyz
- ...
- chapter 0xf CID xyz
That way, other database maintainers can see your published update and build the next volume on top of your latest volume.
If Samczsun incorporates 962_000 and has an additional 1_000 entries, they could publish a manifest containing 962_000 and 963_000
- volume 962_000 (first published at 4byte.directory's IPNS)
- volume 963_000 (published at Samczsun's IPNS)
- chapter 0x0 CID xyz
- chapter 0x1 CID xyz
- ...
- chapter 0xf CID xyz
The Transformer
To make this process smooth, Transformer software could be written to produce Volumes/Chapters and Manifests of the right format.
For example, A “TODD” format, where TODD is a Time-Ordered Distributable Database:
https://github.com/perama-v/TODD
This could exist as a local tool that lets a user participate in creating a distributable database of any artitrary sort:
Any DB in -> TODD out
TODD + new data -> Updated TODD out
This could re-use common functions (manifest creation) and only require a custom adapter for each database type (ABIs, signatures, tags).
The Requirements
To be Transform-able, one needs to outilne a pre- and post-TODD format for the new data.
E.g., Imagine you have neat TODD pieces (volumes and chunks). Then you have a .csv list of new data you want to add. Perhaps it will become the next 3 volumes in the database. The transformer will accept data in a specific input format which could be different for each database (binary files, .tsv, .csv).
All that is required is that the input data have a parser that can loop over all the keys in the data to be incorporated.
.csv -> write transformer parser -> parse -> todd
.tsv -> write transformer parser -> parse -> todd
.bin -> write transformer parser -> parse -> todd
The idea is that you want people to be able to create data in whatever format is convenient for them. Perhaps this format is different between creators - it does not matter because the transformer outputs TODD format which both parties agree on.
The Contract
In a specification, say for fourbyte signatures, one would declare an arbitrary topic, thus making it clear under which term (“fourbyte” “4byte” “event-signatures” etc.) publishing IPNS names should be registered.
See here for more rationale on IPNS name based broadcasting: https://github.com/perama-v/GAMB/issues/1
The Hoovers
Many benevolent actors want to create databases as public goods. This can result in duplication of efforts and vampire attack of users.
Databases that grow do benefit from easy to use APIs/UIs that let users contribute small pieces of data (e.g., pubish ABI or event signature by API). So this is not a claim that “Free API providers are the problem”. Updating databases is a criticial role.
The problem as I see it is the Molochian failure to coordinate.
The solution is Schelling points for database formats and database distribution mechanisms.
Data can be transparent, content-addressble pieces that can be periodically and permanently shared back to the user who then acts as provider.