Data queries in DWN?

Being able to associate multiple records and provide query capabilities over those records is essential to a useful datastore. In relational databases we have the concepts of primary key, foreign key and indexes. However for object stores like DWN (which is more akin to Amazon S3 IMO), developers often resort to creative uses of the object key and bucket key in order meet their search pattern; adding a separate indexing DB/service to meet their needs is also common.

Currently the DWN design is not to have a full-blown query language support given the expectation that a lot the data would be encrypted. Instead, DWN will have a more prescribed query capability against the descriptor properties, again, more akin to Amazon S3 and Azure Blobs. A list of queryable properties against the Collections records conceptually looks like this:

{
    recipient: 'did:example:alice',
    protocol: 'tbdex';
    schema: 'tbdex-ask;
    contextId: 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa';
    recordId: 'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb';
    parentId: 'cccccccc-cccc-cccc-cccc-cccccccccccc';
    dateCreated: 123456789
    dataFormat: 'application/json';
  }

contextId is like the Amazon S3 bucket key; and
recordId is like the Amazon object key, except currently it is even more strict because it needs to be a UUID

This begs the question: should DWN provide “better” or more flexible indexing/search capabilities than say the Amazon S3 to support Web5 developers wanting to perform more targeted, efficient queries over their records in DWN?

e.g. “I want to search all the songs in my DWN written by artist XYZ".

Potential answers:

  1. Out of scope of a DWN, add your own indexing over your records in addition to DWN for these “advance” uses. Note this makes DWN “more-or-less” on parity in terms of query capability with Amazon S3 or Azure Blobs, provided that:

    1. we consider allowing free-formed, developer-defined recordId.
    2. we provide basic prefix and range query capability over these properties
  2. In addition to option 1, allow developers to somehow index custom properties, possible approaches comes in mind:

    1. literally having predefined indexed properties name index1, index2, index3. This is ugly, but would likely solve a majority of DWN use cases and is simple to implement. or
    2. a more elegant implementation that allows up to say X number of properties defined in the descriptor with automatic indexing.

Thoughts?

2 Likes

Very interesting!

Definitely having at least basic prefix and range query capability would be a huge win and put this on parity with s3

People can load a parquet file onto their dwn and use that as a datastore and run queries on this object, Read and Write Parquet file from Amazon S3 - Spark by {Examples}

Interesting! @daniel where do you envision this is best handled - in DWN itself or as a layer above? Does the encrypted nature of the data necessarily put querying/results in the scope of DWN itself?

S,
ALR

Allowing unbounded plaintext fields for such annotations is a huge privacy risk, as devs are likely to abuse it by overloading it with PII/data that divulges far more than the user intends or they realize. We should focus on nailing the simpler indexing by the few exposed metadata fields first, then move to assess other additions. Initially, apps/services that acquire permissions to data can pull in data sets for local indexing, which can be as rich, powerful, and differentiated as they choose.

developers have shown that building on top of just s3 like blobstores is immensely powerful, given there is (sometimes) some ability to index/list contents, or address by a key which is specified as needed (sometimes with creative patterns). I could see developers getting carried away with sets of annotation type fields and accidentally putting information in there that should not have been, but for recordId it seems more focussed and less likely to happen.

Cool thoughts.

I’m toying with a few web :five: app ideas at the moment, one of which is a “union voting” app.

It is an app+protocol where there is a timed permission access to a specific Referendum bucket something like contextId: “referendum/{id}”

I want to give access to users who can:

  • Produce a specific Credential
  • Have not voted before

If these points are satisfied, the user should be able to create a “vote” object inside the bucket referendums/{id}/votes, rather than updating the referendum object itself referendum/{id} . Bonus points if I can cut off access after a certain timestamp. It seems like Daniel has already produced ideas for throttling, this would be handy here to if you can define a such like limiter.for(did).filter("CreateObject", "referendums/{id}/votes").times(1).forever()

This could let voters verify their votes were counted but not see others votes.

The only indexing required would be a “ListObjects” type permission on a specific hierarchical contextID key such as referendum/{id}/votes. This is slightly different to how it works now, where you must filter by Schema name.

Permissions on particular keys for writes as well as reads.

The Hospitality/Travel SIG has some interesting examples too, where a DWN controller would want to allow read/write access to a hotel or airline for a certain amount of time and provide an index specific to that vendor.

:v:

Personal take: I believe this is out of scope for DWNs. How you index is an implementation detail above the DWN and you might want to index in a particular way for a specific use case. Maybe the libs can provide a base indexer as a starting point, but ultimately I think it should be up to the dev if they want to implement another layer of custom indexing.