Part VIII - The Hoovers

Hypothesis: Siloed databases duplicate efforts and reduce coordination.
Methods: Enumerate databases and examine dependency properties and procedures.
Conclusion: Database providers effectively curate important data but having closed and non-standardised backends prevents coordination.

The Survey

Let us examine some important domains to see what databases are present:

Categories:

ABIs
Signatures
Names and Tags
Appearances

ABIs

What is the interface for this deployed contract?

Database	Keys	Values	Format	Data open source	Growth mechanism	Distribution	Size	Link
Sourcify	Contract Address	ABI	JSON	Open source	Web UI, API	Free API, scrape, IPNS, github clone	16GB	sourcify.dev
Etherscan	Contract Address	ABI	JSON	Closed Source	Free API	Free API	-	etherscan.io

Signatures

What is the human readable version of this method or event?

Database	Keys	Values	Format	Data open source	Growth mechanism	Distribution	Size	Link
Ethereum tags	Address	Text signature	JSON	Closed source	Free API	Free API	-	samczsun.com
4Byte site	Hex signature	Text signature	.txt files	Closed source (but published to GH)	Web UI, API	Free API	962_000 entries	4byte.directory
4byte github	Hex signature	Text signature	.txt files	Open source	Periodic PR using data from 4byte.directory	Clone github	962_000 entries	github
Topic0	Hex event signature	Text event signature	.txt files	Open source	Clone and parse Sourcify registry	GH Clone	7_800 entries	github
etk-4Byte	Hex signature	Text signature	Hard coded rust crate	Open source	Clone 4Byte directory	GH Clone	-	crates.io

Tags and Names

What is the meaningful label for this address?

Database	Keys	Values	Format	Data open source	Growth mechanism	Distribution	Size	Link
Ethereum tags	Address	Tags	JSON	Closed source	Free API	Free API	-	samczsun.com
Trueblocks Names	Address	Name, Tags	custom binary file	Yes	Trueblocks-driven	Install trueblocks-core	9.2MB	trueblocks.io
Kleros names	Address	Tags	-	Open source	Token Curated Registry	-	-	kleros.io
RolodETH	Address	Name, tags	JSON files	Open source	GH PR	GH Clone	640_437 entries	github
Scraped Etherscan labels	Address	Name, tags	JSON file	Open Source	GH PR	Clone	3.39MB	github

Address Appearances

Where did my address appear?

Database	Keys	Values	Format	Data open source	Growth mechanism	Distribution	Size	Link
Unchained Index	Address	Appearances (tx ids)	custom binary files	Open source	Create locally and share	IPFS via smart contract publisher	80GB	trueblocks.io
Etherscan	Address	Appearances (txs)	JSON	Closed source	Free API	Free API	-	etherscan.io
(Gedankenexperiment) address-appearance-index*	Address	Appearances (tx ids)	SSZ files	Open source	parse Unchained Index locally and share	IPFS via smart contract publisher	80GB	github spec

Note: address-appearance-index is a prototype/conecept only. As seen in the table it is a derivative of the Unchained Index and thus duplicates effort to get the same data in a new format. I designed the index as the starting point in an exploration of distributable databases. It may be a wiser approach to lean more heavily into Unchained Index than try for a slightly different variation/derivative of it.

The Embedding

When a database is duplicated, the existing database infrastructure is left to waste.

For example, here is an example where the public 4byte registry is absorbed into a closed source database with a free API. The API is then embedded in an open source tool because it has better uptime characteristics.

https://github.com/foundry-rs/foundry/issues/1672

Now there are two competing entities trying to solve the same problem, in a way that detracts from the other - rather than strengthening it. This creates short term solutions that ultimately fail our decentralised needs: Trust me, I’ll keep maintaining this open API for free forever. It may be true, but it makes the open source tool vulnerable.

What if the API is shut down? Well you could try to return to the original sources of the hoovered data. But they may no longer exist because the customers drawn to the “more reliable” closed API and the effort was shut down.

Would it not be better if:

The closed API that gains new data publishes it in a way that contributes to the original data?
A Foundry user had the option of pinning parts of the database?

Agreeing on distributable database formats does not mean “everything has to be slow”. It means that:

Users can contribute to hosting the database.
Separate providers can contribute to the database.

The Formula

Databases can be redesigned so that all the above duplicated efforts could point to the same underlying content.

These entities:

Samczsun’s Ethereum signatures
Snake charmers’ 4Byte
Williams’ Topic0
Quilt’s etk-4Byte

Could all point to content identifiers (CIDs) for their databases. When they update their local database, they could use manifests to coordinate without communicating:

Go to a manifest publishing contract
Use the term “fourbyte” to get a list of IPNS names
Check each IPNS to get a manifest
Look at what each manifest contains. If another manifest has more data, use the CIDs to get that data.
Now if you have additional data, create new additions with their own CIDs and preoduce a new mainfest.
Publish your manifest under your INPS name.

Thus, separate parties could build the same database in a distributed way. They can provide the data via a free and performant API, but a user can also just use the manifests to get the latest data. Users can also pin.

The Formats

Each database type could have an agreed upon format for the values in key-value pairs.

This is a public declaration / specification where one states:

A JSON format
A plain string
An SSZ container

Interested parties could all agree that the format for a key-value pair is the following json:

{
    "key": "hex_xyz",
    "value": "readable_text_xyz"
}

The Sealing

A database that seals old data and never touches it can share pieces with peers.

This is a public declaration / specification where one states the rules to decide how to group new key-value pairs:

When a key-value pair does not fit in the current Volume, an must go in a new Volume
Which Chapter the the key-value pair fits in

For example someone could publish a specification for fourbyte data:

Fourbyte Volumes MUST contain EXACTLY 1_000 entries
Fourbyte Chapters contain entries whose keys all start with the same hex character.

So if you had 1_200 entries to add to the database, you would know that you can only seal 1_000 of them into a Volume. Additionally, the volumes you publish will have 16 different chapters (0x0… to 0xf…).

So if 4byte.directory currently has 961,200 entries they know that at 962_000 they will publish a manifest containing all existing CIDs plus:

- volume 961_000
- volume 962_000 (4byte.directory's latest addition)
    - chapter 0x0 CID xyz
    - chapter 0x1 CID xyz
    - ...
    - chapter 0xf CID xyz

That way, other database maintainers can see your published update and build the next volume on top of your latest volume.

If Samczsun incorporates 962_000 and has an additional 1_000 entries, they could publish a manifest containing 962_000 and 963_000

- volume 962_000 (first published at 4byte.directory's IPNS)
- volume 963_000 (published at Samczsun's IPNS)
    - chapter 0x0 CID xyz
    - chapter 0x1 CID xyz
    - ...
    - chapter 0xf CID xyz

The Transformer

To make this process smooth, Transformer software could be written to produce Volumes/Chapters and Manifests of the right format.

For example, A “TODD” format, where TODD is a Time-Ordered Distributable Database:

https://github.com/perama-v/TODD

This could exist as a local tool that lets a user participate in creating a distributable database of any artitrary sort:

Any DB in -> TODD out

TODD + new data -> Updated TODD out

This could re-use common functions (manifest creation) and only require a custom adapter for each database type (ABIs, signatures, tags).

The Requirements

To be Transform-able, one needs to outilne a pre- and post-TODD format for the new data.

E.g., Imagine you have neat TODD pieces (volumes and chunks). Then you have a .csv list of new data you want to add. Perhaps it will become the next 3 volumes in the database. The transformer will accept data in a specific input format which could be different for each database (binary files, .tsv, .csv).

All that is required is that the input data have a parser that can loop over all the keys in the data to be incorporated.

.csv -> write transformer parser -> parse -> todd

.tsv -> write transformer parser -> parse -> todd

.bin -> write transformer parser -> parse -> todd

The idea is that you want people to be able to create data in whatever format is convenient for them. Perhaps this format is different between creators - it does not matter because the transformer outputs TODD format which both parties agree on.

The Contract

In a specification, say for fourbyte signatures, one would declare an arbitrary topic, thus making it clear under which term (“fourbyte” “4byte” “event-signatures” etc.) publishing IPNS names should be registered.

See here for more rationale on IPNS name based broadcasting: https://github.com/perama-v/GAMB/issues/1

The Hoovers

Many benevolent actors want to create databases as public goods. This can result in duplication of efforts and vampire attack of users.

Databases that grow do benefit from easy to use APIs/UIs that let users contribute small pieces of data (e.g., pubish ABI or event signature by API). So this is not a claim that “Free API providers are the problem”. Updating databases is a criticial role.

The problem as I see it is the Molochian failure to coordinate.

The solution is Schelling points for database formats and database distribution mechanisms.

Data can be transparent, content-addressble pieces that can be periodically and permanently shared back to the user who then acts as provider.