Sync Protocol
Blocks and Commits
Blocks are the smallest unit of data that is transiting in the network and that is saved in Brokers.
A block is an encrypted piece of data, that uses convergent encryption which enables content-addressing and deduplication of identical content, without revealing such content.
A block size is limited to 1MB, and when a blob of data (we call that an Object) is bigger than that, it is sliced into several blocks and a tree of blocks is created. The root block is used to identify the Object.
Blocks have IDs that are nothing more than the hash of their encrypted data, that’s how we do content addressing.
Objects can contain different type of data, namely:
-
Binary files (used to store images, and other multimedia files, or any text or binary file)
-
Commits, which have 3 parts, stored in separate objects: The Header, Content, and Body
-
Other internal objects like Quorum definitions, Signatures, Certificates and RefreshCaps
Commits are what is sent in the events of the pub/sub, and they are the objects that constitute the core of the protocol. All the information and content of documents is in fact encoded in commits.
Commits are organized in a DAG (Directed acyclic graph) that have one root, and at any given time, can have several current HEADS, making the DAG a semi-lattice, that becomes temporarily a lattice when there is only one HEAD that merged all the forks. This happens by example when the total-order transaction mechanism is used.
When new content is created, a new commit containing the modifications is added to the DAG, and all the current HEADS known by the local replica before this insertion, are referenced in the new commit as ACKS, because we say that the new commit acknowledges the previous current heads, and all the causal past that are with them.
This new commit is now the current head at the local replica, and will be sent to the other replicas via the pub-sub.
It can happen that other editors make concurrent modifications. In this case, they will also publish a commit with a causal past (ACKS) that is similar or identical to the new commit we just published.
This will lead to a temporary “fork” in the DAG, and after the replicas have finished their syncing, they will all have 2 current heads. one for each of the concurrent commits.
The next commit (whoever will make more modification in the document), will “merge” the fork when it will publish a new commit that references the 2 heads as ACKS (direct causal past).
And so on. This way, the DAG automatically merges itself, without any conflict, and the branching that might occur, automatically collapses itself back to one single “branch”.
The forking/merging is automatic, and any conflict that could emerge because of concurrent modifications is handled by the respective CRDT format that represents the data.
The documents of NextGraph have several CRDT formats that can coexists.
In order to have a well-formed DAG, only the unique root commit (a singleton) will be without ACKS. All the subsequent commits will have to have at least one ACK that links to a commit in the direct causal past.
A higher resolution version of this diagram can be found here in PNG format or in SVG format.
Branches and Repos
The DAG of commits that we just described, represents in NextGraph, one Branch of a Repo.
A Repo (repository) is a unit that regroups one or several branches of content, together with a set of users known as members. It has a unique ID (a public key) that is immutable (will never change). At a higher level, a Repo is called a Document. But at the level that we are dealing with now, let’s just call it a repo.
When a repo is created, it comes with 2 branches by default :
-
the root branch, which is used to store all the members information, their permissions, the list of branches, and controls the epochs. it does not hold content. Its branchID cannot change because it is in fact, the same ID as the RepoID itself.
-
the main branch, which is a transactional branch (transactional=that holds content) and that will be the default branch if no other branch is specified when a request is made to access the content. It is possible to change the main branch and point it to another branch in the repo.
Each branch also has a unique ID, that is immutable.
When a new commit is published, it is always done inside a branch.
A branch has a topicID associated with it, and when the commit leaves the replica and goes to the broker, it is published on that topicID (and the broker doesn’t even see or know the branchID).
It is possible to renew the topicID during the lifetime of a branch, even several times.
This renew mechanism is used when the capabilities of the branch needs to be refreshed (for read access, when we want to remove read access from some user).
The write access is not controlled by branch, but is controlled more generally at the repo level. it is not possible to give write permission only to one specific branch. When a member is given write permission, it applies to all the branches of the repo at once. The same when write permission is revoked. It is revoked for all the branches of the repo at once.
It is indeed important that permissions are common to all branches, because we will now see that branches can be merged one into another. And when the merge happens, we consider that all the commits of the branches are valid and have been verified already back then, at the moment of every commit addition. We do not want to have to re-verify a whole branch before it is merged. What was already verified and accepted, is immutably part of the repo. If we had a permission system with different permissions for each branch, then there would be cases when some commits in one branch, cannot be merged into another branch because the permissions are incompatible. In order to prevent this, and also to simplify an already very complex design, we restricted the permission management to be only at the repo level, unlike the previous design of LoFi.Re.
The permissions, as said earlier, are stored in the root branch.
All the members of a Repo, can see the list of other members, and their permissions. This is important, because they will need these ACLs to verify the write permission of each commit’s author.
Coming back to our branches, their purpose is double:
-
branches can be used to “fork” a DAG, by example, of the main branch, into a new branch that in GIT terminology would be called a “working branch”, where some parallel work can be done, on the same document. Once this work is completed, it is possible to “merge back” this working branch into the main branch.
-
if to the contrary, the working branch should now become the main branch without a merge (because some concurrent modifications happened in the main branch, and we want to discard them, and prefer to now use the working branch as the main branch), then there is no need to merge, and what is done instead is that we point the main branch towards the working branch.
-
the second use case for branches is to have 2 or more completely different contents for the same document. By example, the main branch contains some text document, and another branch contains some JSON data. Or the main branch contains some RDF triples, and another branch contains some extra RDF triples that have different read permissions.
Indeed, as we already explained, each branch can have different read permission (but all the branches in a repo share the same write permissions).
This is due to the fact that a read permission is in fact a cryptographic capability (ReadCap) that contains a pointer towards a branch. It is possible to share this read capability with someone, and therefor give them read access to a specific branch of the repo, without letting them know of any other branch of that repo. This way, the reader will be confined to read that branch, and will not even be able to access the root branch of repo. The list of editors will not be accessible to them, nor the list of other branches. Sharing a BranchReadCap only gives access to one branch. This ReadCap also includes all the information needed to subscribe to that branch (to the corresponding Topic, to be precise). So when we share a ReadCap, we share the content of one and only one branch, together with the capability to subscribe to future updates on that branch. It is also possible to share all the branches at once, by sharing the ReadCap of the root branch, but that’s something else.
That’s very handy, if we want to separate a Document into several parts that will have different read access.
Let’s say I have a Document that is my personal profile description. it contains my pseudonym, full name, date of birth, postal address, email address, phone number, short biography, profile picture, etc…
Now let’s imagine that for some reasons related to my privacy, I do not always want to share my postal address and phone number with everyone, but instead I want to opt-out sometimes and share the rest, but not the postal address and phone number.
I could create two different documents. one with all the info, and one with the reduced profile.
But that would be cumbersome, as every time I need to update my bio, by example, i would have to copy paste it in both Documents.
The solution is to create only one profile Document, and to put the sensitive information (postal address and phone number) is a separate branch.
Both branches are updatable. If I modify my bio, all the users who subscribed to the main branch, will receive that update. Same with my phone number or postal address : if I update the other branch that contains phone number and address (lets call it the privateProfile branch), then all my close friends with whom I shared that branch, will see the update.
And I can even include a link to the main branch, from within the privateProfile branch, so that those trusted people also have access to the main branch, without need for me to share both branches ReadCaps with them.
If at some point in the future, I want to merge those two branches into one, well.. that, I won’t be able to do it, because in order to merge two branches, they need to share a common ancestor (one branch has to be a fork from the other).
But here, those 2 branches are completely separated one from another. The only thing they share is that they belong to the same Repo, but they both have zero ancestors in their root DAG commit. those 2 DAGs are unrelated one to another. So we cannot merge them.
Another example about how we can use branches to do cool stuff, is for commenting/annotating on someone else’s content. Commenting is a kind of editing, as it adds content. But we don’t want to have to invite those commentators as editors of the document they want to comment on. Instead the commentator will create a standalone branch somewhere on their own protected store (they are free to proceed as they want on that. They can create a special document on their side, that will have the sole purpose of holding all the branches used for each comment on a specific target Document. or they can even use less Documents, and have one general purpose Document in their protected store that is always used to create branches for commenting, regardless of the target document that is commented upon.) What matters is that they are the only editor on that Document, and they will write one comment by branch. The branch subscription mechanism will let them update/fix typos on that specific comment later on. They can also delete the branch at any time, in order to delete their own comment. Once they have created that branch and inserted some content in it (the comment itself), they will send a link (a DID cap) to the original Document they want to comment upon. (each document has an inbox, which is used in this case to drop the link). A comment can reference previous comment, or quote some part of the document (annotation), thanks to RDF, this is easy to do. The owner of the Document that receives this link that contains a comment, can moderate it, accept, reject, or remove it after accepting it. If accepted, the link (DID cap) is added a the special branch for comments, that every document has by default (more on that below). Any reader of the document that subscribed to this branch, will see the new comment.
So, to recap.
-
A branch has specific read permissions, but shares write permissions with all other branches in the repo
-
a branch is the unit of data that can be subscribed to.
-
i can put what i want in a branch
-
i can also fork a branch into another branch, and then merge that fork back into the original branch (or into any other branch that shares a common ancestor)
-
those forks can be used to store some specific revisions of the document. and then, by using the branchId, it is possible to refer to that specific revision.
-
a branch can also be given a name, like “rewriting_paragraph_B“.
-
any given commit has an ID, and that commit can also be used to refer to a specific revision, which in this case, is just the state of the document at that very specific commit. commits can also be given names like v0_1_0 (equivalent to the tags in GIT), and those names are pointers that can be updated. so one can share the name, and update the pointer later on.
-
standalone branches can be used to separate different segments of data that need different read permissions.
-
ReadCaps can be refreshed in order to remove read access to some branch (but the historical data they used to have access to, will always remain visible to them, specially because everything is local-first, so they surely have a local copy of that historical data. what they won’t see are the new updates).
-
we use the terms “DID cap”, “ReadCap”, “URI”, “link” or “Nuri” interchangeably in this document. They all mean the same.
it is also possible to fork a whole repo, if ownership and permissions need to be changed (similar to the “fork me on github” feature) and then there is a mechanism for “pull requests” in order to merge back that forked repo into the original repo. But it doesn’t work like merging of branches, as each commit has to be checked again separately and added to the DAG again, using the identity of a user that has write permission in the target repo. Let’s leave that for now, as it is not coded yet, and not urgent.
The root branch is a bit complex and has all kind of system commits to handle the internals of permissions etc. We will not dive into that right now. There are also some other hidden system branches (called Store, User, Overlay, Chat, etc..) that contain some internal data used by the system, and that you can imagine a bit what it does, given the reserved names they have. but again, let’s keep that for later.
What matters for now is that any transactional branch contains commits that modify the content of the branch, which is a revision of the document.
Those commits are encrypted and sent as events in the pub/sub.
When a commit arrives on a replica, the Verifier is in charge of verifying the integrity of the commit and the branches and repo in general, and this Verifier will need to read the ACLs. it will also verify some signatures and do some checks on the DAG.
If something goes wrong, the commit is rejected and discarded. its content is not passed to the application level.
Eventually, all the replicas have a local set of commits for a branch, and they need to read them and process them once, in order to build the materialized state of the doc. That’s the job of the verifier.