Sync // Metamuse podcast episode 56

Metamuse Episode 56 — May 12, 2022

Sync

The foundational technology for Muse 2 is local-first sync, which draws from over a decade of computer science research on CRDTs. Mark, Adam Wiggins, and Adam Wulf get technical to describe the Muse sync technology architecture in detail. Topics include the difference between transactional, blob, and ephemeral data; the “atoms” concept inspired by Datomic; Protocol Buffers; and the user’s data as a bag of edits. Plus: why sync is a powerful substrate for end-user programming.

Transcript

00:00:00 - Speaker 1: But this totally changes how the data is persisted, and I think that’s important because the only way you get good results on sync systems, especially when you’re talking about offline versus online and partially online, it has to be the one system that you use all the time. You can’t have some second path that’s like the offline cache or offline mode that never works. It needs to be the one true data synchronization persistence layer.

00:00:29 - Speaker 2: Hello and welcome to Meta Muse. Muse is a tool for thought on iPad and Mac, but this podcast isn’t about Muse the product, it’s about me as the company and the small team behind it. I’m here today with two of my colleagues, Mark McGranaghan.

00:00:43 - Speaker 3: Hey, Adam.

00:00:44 - Speaker 2: And Adam Wulf.

00:00:46 - Speaker 3: Yeah, happy to be here.

00:00:48 - Speaker 2: Now Wulf, you are not at all new to the Muse team, I think you’ve been with us for coming up on 2 years now, but it is your first appearance here on this podcast, a long overdue one I would say. So we’d love to hear a little bit about your background and how you came to the team.

00:01:03 - Speaker 3: Yeah, thanks, it’s exciting. Before Muse, I worked for a number of years with Flexits on their calendar app, Fantastical, both on the Mac and the iPhone and iPad. Really enjoyed that. At the same time, I was also working on an iPad app called Luose Leaf, which was an open source just paper inking app, kind of note taking app of sorts, really enjoyed that as well.

00:01:28 - Speaker 2: And I’ll know when we came across your profile, let’s say, and I was astonished to see loose leaf. It felt to me like a sort of the same core vision or a lot of the same ideas as Muse, this kind of like open-ended scratch pad, multimedia inking fluid environment, but I think you started in what, 2013 or something like that, the Apple pencil didn’t even exist, and you were doing it all yourself and, you know, in a way maybe too early and too much for one person to do, but astonishing to me when I saw the similarity, the vision there.

00:02:03 - Speaker 3: Yeah, thanks. I think the vision really is extremely similar. I really wanted something that felt physical, where you could just quickly and easily get to a new page of paper and just ink, and the, the app itself got out of your way, and it could just be you and your content, very similar to you sitting at your desk with some pad of paper in front of you. But yeah, it was, I think I started when the iPad 2 was almost released. And so the hardware capabilities at the time were dramatically less, and the engineering problems were exponentially harder as a result of that, and it was definitely too early, but it was a lot of fun at the time.

00:02:42 - Speaker 2: And I think one of the things that came out of that, if I remember correctly, is this open source work you did on ink engines, which is how we came across you. Tell us what you did there.

00:02:52 - Speaker 3: Yeah, there’s a few different libraries I ended up open sourcing from that work.

One was the ink canvas itself, which that was the most difficult piece for me. The only way to get high performance ink on the iPad at the time was through OpenGL, which is a very low level.

Usually 3D rendering pipeline. I had no background in that, and so it was extremely difficult to get something up and running with that low level of an architecture.

And so, once I had it, I was excited to open source it and hopefully let other people use it without having to go through the same pain and horror that I did to make it work.

But then one of the other things that was very useful that came out of loose leaf was a clipping algorithm for Bezier curves, which are just fancy ways to define ink strokes, basically, or fancy ways to describe long curvy, self-intersecting lines. And that work has also been extremely important for Muse as well. We use that same library and that same algorithm to implement our eraser and our selection algorithms.

00:04:05 - Speaker 2: And when you’re not deep in the bowels of inking engines, or as we’ll talk about soon sinking engines, what do you do with your time?

00:04:13 - Speaker 3: Oh, I live up in northwest Houston in Texas with my wife Christie and my daughter Kaylin. And she is in high school now, which is a little terrifying, and learning to drive and we’re starting that whole adventure, so that’s been fun for us. I try and get outside as much as I can. I’ll go backpacking or hiking a little bit. That can be fun, and the Houston summer, it’s rather painful, but the springs and the falls, we have nice weather for outdoors and so.

00:04:42 - Speaker 2: What’s the terrain like in the day trip kind of range for you? Is it deserty? Are there mountainous or at least hilly areas, or is it pretty flat?

00:04:52 - Speaker 3: It is extremely flat and lots and lots of pine trees, and that’s pretty much it. Just pine trees and flat land. Sometimes I’ll drive a few hours north. We have some state parks that are nice and have a bit of variety compared to what’s immediately around Houston, so that’s a good backup plan when I have the time.

00:05:14 - Speaker 2: Flat with a lot of trees sounds surprisingly similar to the immediate vicinity of Berlin. I would not have expected Texas and northern Germany to have the commonality there. It gave me a lot of appreciation for the San Francisco Bay Area, while that city didn’t quite suit. Me, as we’ve discussed in the past, one thing that was quite amazing was the nature nearby and a lot of that ends up being less the foliage or whatever, but more just elevation change. Elevation change makes hikes interesting and views interesting and I think itself leads to, yeah, just landscape elements that engage you in a way that flatness does not.

00:05:55 - Speaker 3: Yeah, absolutely. I lived in the Pacific Northwest for a while, and the trees there are enormous, and the amount of green and elevation change there is also enormous. And so when we moved back to Houston, it was a bit of a shock almost to see what I used to think were tall trees in Houston are really not very tall compared to what I lived around up in Portland, Oregon.

00:06:21 - Speaker 2: So our topic today is sync.

Now Muse 2.0 is coming out very soon. We’ve got a launch date May 24th. Feels like tomorrow for our team scrambling to get all the pieces together here, but the biggest investment by far, even though we have the Mac app and we have text blocks are a part of it, the biggest kind of time, resource, energy, life force investment by far has been the local first sinking engine.

And we’ve spoken before about local first sync as a philosophy generally in our episode with Martin Klapman, but I thought it would be good to get really into the details here now that we have not only built out this whole system, both the client side piece and the server piece. But also that we’ve been running it in, won’t quite call it production, but we’ve been running it for our beta for a few months now, and we have quite a number of people using that, some for pretty serious data sizes, and so we’ve gotten a little glimpse of what it’s like to run a system like this in production. So first, maybe Mark, can you describe a little bit how the responsibilities breakdown works in terms of between the two of you on the implementation?

00:07:32 - Speaker 1: Yeah, so I’ve been developing the back end or the server component of our sync system, and Wulf has been developing our iOS client that is the core of the actual app.

00:07:45 - Speaker 2: Yeah, on that side, I kind of think of the client persistence or storage layer as being the back end of the front end. So that is to say it’s in the client, which obviously is a user interface heavy and oriented thing, but then it persists the user data to this persistence layer which in the past was core data, is that right? Well the kind of standard iOS storage library thing.

00:08:08 - Speaker 3: Yeah, that’s exactly right. Yeah, we used core data, which is Apple’s fancy wrapper on top of a SQL light database. And that just stores everything locally on the iPad, like you were saying, so that way the actual interface that people see, that’s what it talks to.

00:08:25 - Speaker 2: And then that persistence layer within the client can talk to this back in the mark has created. And much more to say about that, I think, but I thought it would be nice to start with a little bit of history here, a little bit of motivation.

I’ll be curious to hear both of your stories, but mine actually goes back to using my smartphone on the U-Bah, so that’s the subway system here in Berlin, when I was first working with some startups in the city back in, I guess it would have been 2014, so, 8 years ago I had this experience of using different apps and seeing how they handled both the offline state but actually the kind of unstable state because you have this thing where the train car goes in and out of stations and when you’re in the station, you usually have reception, weak reception, and you leave the station that fades off to you essentially fully offline, and so you’re in this kind of unreliable network state all the time.

And two that I remember really well because they were really dramatic, was one was pocket, which is the relator tool I was using at the time, and it handled that state really well. If it couldn’t load an article, it would just say you’re offline, you need to come back later, but the things it had saved, you could just read. The other one I was using was the Facebook mobile app, and there I was amazed how many errors and weird spinners, and you go to load a thing and it would get half of it, but not the rest of it, and the app just seemed to lose its mind because the network was unreliable, and I found myself thinking, what would make it possible to make more apps to work the way the pocket does and less the way that Facebook works. And I also had the opportunity to work with some startups here, including Clue and Wunderlust and some others that had their own.

Essentially everyone needs this. Everyone needs syncing because they want either one, the user to be able to access their stuff from different devices, or 2, they want some kind of sharing, and I think Vonunderlust was an interesting case because they built out this crack engineering team. To develop really good real-time syncing for a very simple case. It’s just a to do list, and the common case that people use it for, I think was, you know, a couple that’s grocery shopping and they want to like, make sure they don’t overlap and pick the same things in the cart. But it worked really well, but they built this huge, I think it was like a 15 person engineering team that spent years of effort to make really good real-time sin, and it seemed strange to me that you need this big engineering team to do what seems like a simple thing that every app needs.

We went down this road of trying CouchDB and Firebase and a bunch of others, and all were pretty unsatisfying.

And then that further led in, you know, that kind of idea, the sync problem lodged in my mind and then when we got started at ink and Switch, some of our early user studies there were on sync and how people thought about it. And one thing that stuck with me from those was we looked into just kind of syncing on. And note taking apps and talked to a whole bunch of people about this, and we didn’t have a product at the time, so it was just kind of a user research study, but we went and talked to a bunch of folks, most of whom were using Evernote was kind of the gold standard at the time. And almost everyone we talked to, when I asked what’s your number one most important feature from your notes app, they said sync and said, OK, so that’s why you chose Evernote, and they said, yeah, and they said, how well does it work? And they said terribly, it fails all the time. You know, I write a note on my computer, I close the lid, I go to lunch. Half an hour later, I go to pull it up on my phone. It’s not there. I have no idea why. And so some combination of those experiences sort of lodged this thing in my mind of the technology industry can just do so much better, and this is important and everyone needs it. What’s the missing piece. And I wasn’t really sure, but that led into once I met up with folks in the research world who indeed had been working on this problem for a while, and I got excited about the technologies they had to offer.

00:12:15 - Speaker 1: Yeah, and then I guess I was downstream of that because I got introduced to space by Peter Van Hartenburg with time was a principal at the Inn Switch Research Lab, and it’s now the director of the lab.

And he showed me a demo of the Pixel pusher project, and we can link to the article on this, but essentially this is a Pixel art editing tool that was peer to peer collaborative, and the app itself is very standard, but was amazing to me was he had implemented this app and he had 2 devices or 2 windows on the same device, and they were doing real-time collaboration, but there was no server.

And I had come from this world of wherever you add a feature to an app, you gotta write the front end and then you gotta write the back end, you gotta make sure they line up whenever anything changes, it’s a whole mess, and it was just magical to me that you could just type up this JavaScript app and have it collaborating with another client in real time.

So I went down that rabbit hole, and there was the obvious attractions of the austere locations and, you know, minimal network connectivity and things like that. And also at the time the research was very oriented around P2P, so there was this notion of the user having more control of their data and perhaps not even requiring a central server, but a couple of things became even more appealing to me as I researched it more. One was that Potential of higher performance. And I ended up writing a whole article about software performance that we can link to. But one of the key insights was that it’s not physically possible to have acceptably fast software if you have to go anywhere beyond the local SSD. Now, certainly if you’re going to a data center in Virginia or whatever, you’re totally hosed. So it was very important to incorporate this performance capability into Muse.

00:13:49 - Speaker 2: Yeah, that article was eye opening for me and that you connected the research around human factors, things that looked at what level of latency you needed for something to feel snappy and responsive, and then separately the speed of light, which is how sort of the maximum possible speed that information can travel, and if you add those together or do very simple arithmetic on that, you can instantly see it’s not about having a faster network connection. You literally cannot make something that will feel fast in the way that we’re talking about if you have to make a network round trip.

00:14:21 - Speaker 1: Yeah, and the one other thing that was really interesting to me about this space was the developer experience.

I alluded to this earlier with the Pixel Pusher demo, but in the before times there were two ways to develop apps.

You had the local model where you were typically programming against the SQL database, and everything was right there and it sort of made perfect sense. You would query for what you need and you write when you have new information and so on.

And then there was the remote model of you would make rest calls, for example, out to some service like admit this edit or add a new post or whatever.

But then these two worlds were colliding where we always wanted to be adding sync and collaborative capabilities to our apps, we would try to kind of jam one into the other, like you would try to patch some rest onto the database or you try to patch some database on yours and it just wasn’t working, and I realized we need to do a pretty fundamental rethink of this whole architecture, which is what we end up doing in the research lab and then now with Muse.

The last thing I’ll mention about my journey here was my background was in back in engineering and distributed systems engineering, and so I had encountered variants of the sync problem several times, for example, at Hiroku, Adam. We had this challenge of we had these routers that were directing HTTP requests to a back end that was constantly changing based on these dinos coming up and down, and the routers needed to maintain in memory router tables based on the control plan that was being adjusted by the API.

And so we had a similar problem if you need to propagate consistently in real time state to the in-memory databases of all these router nodes, and sure enough that work kind of came full circle and we were applying some of the same lessons here with Muse. So it’s a problem I’ve had the opportunity, for better or worse, to make a few passes at in my career.

00:15:57 - Speaker 3: Yeah, I think it’s an extremely hard problem that comes up so often across so many projects is eventually you need data over here in Box A to look the exact same as data over here in Box B. and it’s one of those problems that’s just surprisingly hard to get right, and there just aren’t that many libraries and existing solutions for it to drop in and implement. A lot of other libraries you can just go out and find it, and there’s code out there, or you can license it or open source, whatever, but for whatever reason, sync is one of those things that’s for every project, it needs to be custom baked to that project, just about every time.

00:16:38 - Speaker 2: And that’s part of what blew my mind back 8 years ago when I was looking for a sinking layer for clue and realizing that, yeah, I just had this feeling like surely everyone has this problem, everyone needs it, everyone needs the same thing. It’s really hard, you know, an individual company shouldn’t be distracting from their core competency of building their app to create the sinking layer, and yet to my surprise, there really wasn’t much, and that continues to basically be true today.

00:17:06 - Speaker 1: Yeah, and this gets into our collaboration with Martin Klutman on CRDTs.

So briefly you can think of there being two pieces to this problem. One is conveying the physical data around, and the other is, OK, you have all this data that synchronize, what do you do with it, because it’s all a bunch of conflicting edits and so on.

And that’s where the CRDT technology came in. I think one of the reasons why we haven’t seen widespread standard libraries for this stuff is the thinking problem is hard. We’ll talk more about that. But another is that we haven’t had the computer science technology to make sense of all of these edits. Well, we sort of did. There was like operational transforms, but you literally need to hire a. Team of PhD computer scientists have any shot at doing stuff like that. And so Google Docs basically had it and maybe a few others, but normal humans couldn’t do anything with it. But the CRDT technology and automerge, which we’ll talk more about, made it much more accessible and possible to make sense of all these conflicting edits and merge them into some useful application state. So that’s the kind of why now of why now is a good time I think to be pursuing this.

00:18:06 - Speaker 3: Yeah, and I think almost surprisingly to me, the solution we came up with at Muse, I think is actually really generic, and I think we solve it in a really elegant way that’s even more foundational to the technology than solving just for use. I think the solution we have. Can certainly solve from use in the future and is futureproof in that regard, but is broad enough to be applicable to a whole number of different uses and applications, which I think is really exciting too.

00:18:37 - Speaker 2: Maybe it’s worth taking a moment to also mention why we think local first in the style of sync is important for you specifically. I think certainly Mark and I have had a long time interest in it. Well, if you have an interest in it, so it’s just something that’s more like we’d like to see more software working in this way where the user has a lot more sort of control and literal ownership over the data because it’s on their device. In addition to being mirrored in the cloud, certainly the performance element is huge for me personally, and I think for all of us on the team. But I think Muse, as we go to this multi-device world, on one hand, we think that every device has its own kind of unique mood. The iPad is this relaxed space for reading and annotating, whereas the Mac or a desktop computer is for focus, productivity, you know, the phone is for quick capture, the web is good for sharing. OK, so really you need your work to be seamlessly across all of them.

But at the same time, you know, we want that sense of intimacy and certainly the performance and the feeling that it’s in your control and you own it, and it belongs to you.

I think that maybe matters less for some consumer products, or maybe it matters less for more kind of B2B, you know, enterprisey products, but for this tool, which is for thinking.

Which is very personal, which is very kind of needs to be at your fingertips and friction free. I think the local first approach would be a good fit for a lot of software, but I think Muse needs it even more than most. So that’s why I’m really excited to see how this works out in practice as people try it out, and we really don’t know yet, right? It may be that we’ve made this huge engineering investment and in the end customers just say, I’d be happy with the cloud, yeah, it’s fine. I have some spinners, I can’t access my work offline. I hope not. But that could happen. We could be like falsifying the business hypothesis, but I really believe that for our specific type of customer, you’ll go to use this product with the sinking layer, you know, once we shake out all the bugs and so on and say, you know, this feels really fundamentally different from the more cloud-based software that I’m used to like an ocean and also fundamentally different from the non syncing pure local apps that I might use.

00:20:51 - Speaker 3: Yeah, I really think that with as connected as this world is and is becoming, there’s always going to be places of low connectivity, there’s always going to be places of just dodgy internet, and having an application that you know just always works, no matter what’s going on, and figures itself out later once it has good internet, is just so freeing compared to Those times when, you know, your device is switching Wi Fi networks or the LTE is just not quite what it needs to be to make things happen.

I think it really does make a really huge difference, especially when you’re deep in thought, working on your content in use, the last thing you want is to be interrupted for even half of a second with a small spinner that says please connect to the internet. And so just being able to free the application and free the user from even worrying about the internet at all, even if it works 99% of the time, it’s that 1% of the time that breaks your train of thought that is just really frustrating. And I think that’s what’s exciting about being able to be purely offline is it fixes that really huge problem of that really small percentage of time that it happens.

00:22:10 - Speaker 2: Very well said. Now with that, I’ll do a little content warning. I think we’re about to get a lot more technical than we ever have on this podcast before, but I think this is a topic that deserves it. So I’d love to, and me especially as someone who’s not deep in the technology and just observing from the sidelines, I’d love to hear about what’s the high level architecture, what are all the pieces that fit together here that make this syncs when you’re on the internet and lets you keep working even when you’re not? What is it that makes all that work?

00:22:41 - Speaker 1: Yeah, I’ll give a really quick overview and then we can dive into some of the specific pieces.

So to start with the logical architecture, the basic model is a user has a bag of edits, so you might have 1000 edits or a million edits where each edit is something like I put this card here or I edit this picture, and over time the user is accumulating all these edits and the job of the sync system is to ensure that eventually all of the users' devices have the same bag of edits.

And it passes those edits around as opaque blobs and different flavors of blobs we’ll talk about.

Basically there’s a bunch of bits of binary data that all devices need to have the same, and then it’s the device’s responsibility to make sense of those edits in a consistent way.

So given the same bag, each device needs to come up with the same view of the muse corpus of that individual user, what boards are alive and what cards are on them and so forth. And then briefly in terms of the physical architecture, there’s a small server that’s running on Hiokku, data is stored in post grass and S3 and it’s implemented in Go, and again the server is just shuffling binary blocks around basically. And then there’s different front ends, different clients that implement this synchronization protocol and present a use corpus model up to the application developers. So the most important of these is the SWF client. We also have a JavaScript client and both of these back to SOI databases locally.

00:24:09 - Speaker 3: Yeah, and I think what’s really interesting about this architecture is that we actually maintain the entire bag of edits.

Edits only get added into the bag, but they never really get removed. And so the current state of the application is whatever the most recent edit is.

So if I make a card bigger on my Mac, and then I go to my iPad and I make that same card smaller. And then we synchronize those two things. Well, at the end of the day, either the card is going to be smaller on both devices, or the card is gonna be bigger on both devices, and we just pick the most recent one. And that strategy of just picking the most recent edit actually makes conflicts essentially invisible or so small and so easy to fix that the user can just, oh, I want that big, let me make it big again. It’s really easy to solve. For the user side without showing up one of those really annoying, hello, there’s been an edit. There’s a conflict here. Would you like to choose copy A or copy B? Just being able to automatically resolve those is more than half of the magic, I think, of this architecture.

00:25:13 - Speaker 2: I also note this is a place where I think the muse domain, if you want to call it that, of the cards on a canvas model works pretty well with this sort of automated resolution, which is if you moved a card in one direction on one device and you moved it somewhere else on the other device, it’s not really a huge deal which one it picks as long as it’s all kind of like flows pretty logically.

By comparison, text editing, so what you have in a Google Docs or certainly I know auto merge team and the incode switch team has done a huge amount of work on this, is a much harder space where you can get into very illogical states if you can merge your edits together, strangely, but I think a card move, a card resize, add remove, even some amount of reparenting within the boards, those things just are pretty natural to merge together, I think.

00:26:02 - Speaker 3: Yeah, I think so, and I think even with the new text block feature in Muse, we end up slicing what would be a really long form text document into much smaller sentences or paragraphs. And so then text edits, even though we’re only picking the kind of the most recent one to win, we’re picking that most recent at the granularity of the sentence or of the the paragraph, and so. Conflicts between documents for us are larger than they would be for automerge or for Google Docs, but are small enough that it’s still ignorable for the user and easily solvable by the user.

00:26:42 - Speaker 2: Which incidentally I think is a trick we sort of borrowed from FIMA, at least on the tech side, which is in FIGA and also in Muse. If one person edits, you know, the red car and someone else edits the blue car, you don’t get the red blue car, you just get one or the other, and it turns out for this specific domain, that’s just fine.

00:27:03 - Speaker 3: Yeah, I think we kind of lucked out having such a visual model, and we don’t need to worry about intricacies of multi-user live document editing.

00:27:13 - Speaker 1: Yeah, I would point to both Sigma and actual budget as two very important inspirations for our work. I would say those are two of the products that were most at the forefront of this space, and thought about it most similarly to how we did.

And notably they, as well as us sort of independently arrived at this notion of basically having a bunch of last white wins registers. As the quote unquote CRDTs.

So these are very, very small, simple, almost degenerate CRDTs where the CRDT itself is just representing one attribute, for example, the X coordinate of a given card. But this is an important insight of the industrial application of this technology, if you will. That’s a good trade-off to make it. It basically covers all the practical cases, but it’s still very simple to implement, relatively speaking.

00:28:03 - Speaker 2: I also mentioned briefly actual budget, great in the basically made by one person app and recently open source, so you can actually go and read the custom CRDT work there and maybe learn a thing or two that you might want to borrow from.

00:28:17 - Speaker 3: I think one of the really interesting problems for me about the CRDT was Deciding which edit is the most recent because it just makes logical sense to say, oh well, it’s 3 o’clock, and when I make this edit at 3 o’clock and I make a different edit at 3:02, obviously the one at 3:02 wins.

But since computer clocks aren’t necessarily trustworthy, sometimes I have games on my iPad that reset every day and so I’ll set my clock forward or set my clock backward. Or if I’m on an airplane and there’s time zones, and there’s all kinds of reasons the clock might jump forward or jump backward or set to different problems, and so using A fancier clock that incorporates a wall clock, but also includes a counter and some other kind of bits of information, lets us still order edits one after the other, even if one of those clocks on the wall is a year ahead of schedule compared to the other clocks that are being synchronized. I don’t know how in depth we want to get on that, but it’s it’s called a hybrid logical clock.

00:29:23 - Speaker 1: Yeah, I think this is another great example along with choosing very simple CRDT structures of industrial style architecture where you could go for a full blown vector clock, and that gives you perfect logical ordering and a bunch of other nice properties, but it’s quite large and it’s expensive to compute and so on. Whereas if you choose a simpler fixed size clock, that can give you all the benefits that you need in practice, it can be easier to implement, it could be faster to run, and so on.

00:29:52 - Speaker 3: Like everything in life, it’s all about trade-offs, and you can get accuracy, but it costs more, or you can get a little bit less accuracy, and it costs a lot less, and for us that was the better trade-off to have a fixed size clock that gives us Enough of the ordering to make sense, but might not be exactly perfect ordering.

00:30:13 - Speaker 1: And we’ve been alluding to trade-offs and different options, so maybe it’s time to address it head on in terms of the other options that we considered and why they weren’t necessarily as good of a fit for us. So I would include in this list both iCloud and what you call like file storage.

00:30:27 - Speaker 2: It might be like cloud kit or something, but yeah, they have one that’s more of a blob, kind of, you know, save files, what people will think of with their sort of iCloud drive, almost kind of a Dropbox thing, and then they also have a cloud kit. I feel like it’s a key value store, but in theory, those two things together would give you the things you need for an application like ours.

00:30:47 - Speaker 1: Yeah, so there’s iCloud as an option, Firebase, automerge. CouchDB maybe, then there’s the role you’re on which we ended up doing.

00:30:57 - Speaker 2: Yeah, the general wisdom is, you know, you don’t write your own, if there’s a good off the shelf solution, you name some there that are commercial, some are built into the operating system we’re using, some are indeed research projects that we’ve been a part of, what ultimately caused us to follow our own path on that.

00:31:15 - Speaker 1: Yeah, so there was a set of issues that tended to come up with all of these, and it was more or less in different cases, but I think it’d be useful to go through the challenge that we ran into and talk about how they emerged in different ones of these other solutions.

So one simple one, it would seem it’s just like correctness slash it works. And the simple truth is, a lot of the singing systems out there just do not work reliably. Hate to pick on Apple and iCloud, but honestly, they were the toughest in this respect where sometimes you would, you know, admit data to be synchronized and just wouldn’t show up, and especially with opaque closed source solutions and third party solutions, stuff would not show up and you couldn’t do anything about it, like you couldn’t see what went wrong or when it might show up or if there was some error.

And then bizarrely, sometimes the stuff would pop up like 5 or 10 minutes later. It’s like, oh, it’s actually sort of worked, but it’s off by You know, several zeros in terms of performance. So that was a really basic one, like the syncing system has to be absolutely rock solid and it kind of goes back to the discussion Wulf had around being offline sometimes. If there’s any chance that the sync system is not reliable, then that becomes a loop in the user’s mind. Am I gonna lose this data? Is something not showing up because the sync system is broken. Our experience has been that if there’s any lack of reliability or lack of visibility into the synchronization layer. It really bubbles up into the user’s mind in a destructive way, so we want it to be absolutely rock solid. Another important thing for us was supporting the right programming model. So we’ve been working on news for several years now. We have a pretty good idea of what capabilities the system needed to have, and I think there were 4 key pillars. One is the obvious transactional data. It’s things like what are the cards and where are they on the board. This is data that you would traditionally put in a SQL database. Another thing that’s important to have is blob support, to a lot of these binary assets in use, and we wanted those to be in the same system and not have to have another separate thing that’s out of band, and they need to be able to relate to each other correctly.

00:33:09 - Speaker 2: This is something where a 10 megabyte PDF or a 50 megabyte video just has very different data storage needs than the tiny little record that says this card is at this X and Y position and belongs to this board.

00:33:23 - Speaker 1: Right, very different, and in fact you’re gonna want to manage the networking differently.

Basically you want to prioritize the transactional data and then load later, or even lazily, the binary data, which is much larger.

Yeah, so there was transactional data, blob data, and then real-time data slash ephemeral data.

So this is things like you’re in the middle of an ink stroke or you’re in the middle of moving a card around and this is very important to convey if you’re gonna have real time and especially multi-user collaboration, but again, you can’t treat this the same as certainly blob data, but even transactional data, because if you store every position a card ever was under your finger for all time, you’re gonna blow up the database.

So you need those 3 different types of data, and they all need to be integrated very closely.

So for example, when you’re moving a card around, that’s real time, but basically the last frame becomes a bit of transactional data, and those two systems need to be so lined up to each other that it’s as simple as changing a flag. If you’re going on a 2nd or a 3rd band for real-time data and need to totally change course for saving the transactional data, it’s not gonna be good.

It was quite rare. I don’t know if we found any systems that support all three of these coherently.

00:34:33 - Speaker 2: The ephemeral data element I found especially interesting because you do really want that real timey feeling of someone wiggles a card with their finger and you can see the wiggling on the other side. That just makes the thing feel live and Just responsive in a way that it doesn’t otherwise.

But yeah, at the same time, you also don’t want hundreds of thousands of records of the card moved 3 pixels right, and then 3 pixels left.

And one thing I thought was fascinating, correct me if I misunderstood this, but is that because the client even knows how many other devices are actively connected to the session, it can choose to not even send that ephemeral data at all. It doesn’t even need to tap the network. If no one else is listening, why bother sending ephemeral data? All you need is the transactions over time.

00:35:21 - Speaker 1: Right, this is actually a good example of how there’s a lot of cases where different parts of the system need to know or at least benefit from knowing about other parts.

So it becomes costly or or maybe just an outright bad idea to separate them, especially as we’re still figuring out as industry how they should work. I think there’s actually quite a bit of benefits to them being integrated.

Another. that we could talk about eventually is prioritizing which data you download and upload, you might choose to first download blobs that are closer to you in board space, like it’s in your current room or it’s in adjacent rooms, and then later you can download other blobs. So that’s something you could do if the application had no notion of the networking layer.

It actually brings us to Another big challenge we saw with existing systems, which is multiplexing. So I’ll use an example of automerge here, and this is something we’ve seen with a lot of research oriented CRDT work. It’s very focused on a single document, so you have a document that represents, you know, say a board or whatever, and a lot of the work is around how do you synchronize that document, how do you maintain correctness, even how do you maintain performance when you’re synchronizing that document across devices.

Well, the challenge with Muse with our model.

You might have, you know, easily 1000, but, you know, potentially tens of thousands up to millions of documents in the system corresponding to all your individual cards and so on. And so if you do anything that’s order and in the number of documents, it’s already game over. It needs to be the case that, here’s a specific challenge that I had in mind for the system. You have a corpus, let’s say it’s a million edits across 10,000 documents or something like that, and it’s 100 megabytes. I wanted the time to synchronize a new device that is to download and persist that entire corpus, to be roughly proportional to the time it would take to just physically download that data. So if you’re on a 10 megabyte connection, 100 megabyte connection, maybe that’s 10 seconds. But the only way to do that is to do a massive amount of like multiplexing, coalescing, batching, compression, so that you’re taking all these edits and you’re squeezing them into a small number of network messages and compressing them and so on. So you’re sort of pivoting the data, so it’s better suited to the network transfer and the persistence layer. And again, you need to be considering all these things at once, like how does the application model relate to the logical model, relate to the networking protocol, relate to the compression strategy, and we weren’t able to find systems that correctly handle that, especially for when you’re talking about thousands or millions of documents being synchronized in parallel. And the last thing I’ll mention is what I call industrial design trade-offs. We’ve been alluding to it in the podcast so far, but things like simplicity, understandability, control, these are incredibly important when you’re developing an industrial application, and you tend not to get these with early stage open source projects and third party solutions and third party services. You just don’t have a lot of control and it was too likely to my mind that we would just be stuck in the cold at some point where system didn’t work or it didn’t have some capability that we wanted, and then you’re up a dead end road, and so what do you do? Whereas this is a very small, simple system. You could print out the entirety of the whole system it would be probably a few pages, well it’s a few 1000 lines of code, it’s not a lot of code, and it’s across it’s a couple code bases, and so we can load the whole thing into our head and therefore understand it and make changes as needed to advance the business.

00:38:38 - Speaker 3: Yeah, I think that last point might honestly be the most important, at least for me. I think having a very simple mental model of what is happening in sync makes things so much easier to reason about. It makes fixing bugs so much easier. It makes preventing bugs so much easier. We’ve been talking about how sync is hard and how almost nobody gets it right, and that’s because it’s complicated. There’s a bajillion little bitty edge cases of if this happens, but then this happens after this happens, and then this happens. What do we do? And so making something really really simple conceptually, I think was really important for the muse sync stability and performance at the end of the day.

00:39:21 - Speaker 2: I’m an old school web developer, so when I think of clients and servers, I think of rest APIs, and you maybe make kind of a version API spec, and then the back end developer writes the endpoint to be called to and the front end developer figures out how to call that with the right parameters and what to do with the response. What’s the diff between a world that looks like that and how the new sync service is implemented?

00:39:50 - Speaker 1: Yeah, well, a couple things. At the network layer, it’s not wildly different. We do use protocol buffers and binary encoding, which by the way, I think would actually be the better thing for a lot of services to do, and I think services are increasingly moving in that direction, but that core model of you have, we call them endpoints. You construct messages that you send to the endpoint and the server responds with a response message. That basic model is pretty similar, even if it’s implemented in a way that’s designed to be more efficient, maintainable, and so on than a traditional rest server.

But a big difference between A traditional rest application and the muse sync layer is that there are two completely separate layers, what we call the network layer and the app layer. So the network layer is responsible for shuffling these binary blobs around the transactional data, the ephemeral data, and the big binary assets. And the server knows absolutely nothing about what’s inside of them by design, both because we don’t want to have to reimplement all of the muse logic about boards and cards or whatever in the server, and also because we anticipate eventually end to end encrypting this, and at that point, of course, the server can’t know anything about it, it’s not gonna be possible. So that’s the networking layer and then if you sort of unwrap that you get the application layer, and that’s the layer that knows about boards and cards and edits and so on. And so it is different, and I would say it’s a challenge to think about these two different layers. There’s actually some additional pivots that go on in between them, versus the traditional model of you would like post V1 slash boards directly and you’d put the parameters of the boards and then the surfer would write that to the boards table and the database. There’s a few different layers that happen with this system.

00:41:30 - Speaker 2: So if we want to add a new card type, for example, or add a new piece of data to an existing card, that’s purely in the application layer on the back end, or it doesn’t know anything about that or no changes are needed on the back end.

00:41:44 - Speaker 1: Yeah, no changes are needed.

In fact, one of the things I’m most proud about with this project is we basically haven’t changed the server since last year, December, and we’ve been, you know, rigorously iterating on the app, you know, adding features, changing features, improving a bunch of stuff, and the servers, it’s basically the same thing that was up 4 months ago, just chunking along, and that’s a benefit. It’s a huge benefit, I think, of this model of separating out the application model and the network model, because the network model is eventually gonna move very slowly. You basically figure that out once and I can run forever. And the application model has more churn, but then when you need to make those changes, you only need to make them in the client or the clients that maybe you update the application schema so that current and future clients can understand that, and then you just start including those data in the bag of edits.

00:42:26 - Speaker 3: Yeah, I think one thing that’s really nice is that those protocol buffers that you were talking about are type safe and kind of statically defined, so that way it’s when we’re sending that message over the wire, we know 100% we’re sending exactly the correct messages no matter what, and that guarantee is made at compile time, which I think is really nice because it means that a lot of bugs that could otherwise easily sneak in if we’re using kind of a generic JSON framework, we’re gonna find out about when we hit the please build muse button. Instead of the I’m running views and I randomly hit a bug button. And that kind of confidence early on in the build process has been really important for us as well to find and fix issues before they even arise.

00:43:11 - Speaker 1: Yeah, to my mind this is the correct way to build network clients. You have a schema and it generates typesa code in whatever language you want to use.

There’s just enormous benefits to that approach. I think we’re seeing it with this on use and again, I think more systems, even more traditional B2B type systems are moving in this direction.

By the way, everyone always made fun of Amazon’s API back in the day. I had this crazy XML thing where There’s a zillion endpoints. I actually think they were closer to the truth and the traditional, you know, nice rest crud stuff because their clients are all auto generate and sure enough they have like literally a zillion endpoints, but everything gets generated for free to a bunch of different languages.

Anyways, one challenge that we do have with this approach is, you know, one does not simply write a schema when you have these multiple layers. So again, if you look at a traditional application, you have a protocol buffer definition of, say, a board B board and probuffs. And that would have fields like title and width and height or whatever. And when you want to update the board, you would populate a memory object and you would encode this to a protocol buffer and you would send this off to the server. Well, it’s not quite that simple for us because we have this model of the small registers that we call atoms.

So an atom is the entity, say a board, the attributes say the title, the value say use podcast, and the time stamp. And your bag of edits is comprised of all these different atoms, but the problem is, how do you encode both how you’re gonna send an atom, which is as those twopos, as well as what a logical board is, you know, what the collection of atoms is meant to look like, you know, it’s gonna have a title and the width and height and so on. So that’s been honestly a pretty big challenge for us where it doesn’t fit into any of the standard schema definition approaches, certainly not the regular protocol buffer schema, which again we use for the network and for encoding the messages that are wrapped up in the network, but you need a separate layer that encodes the application model, as we call it, you know, what is a board, what is a card, what attributes that they have and so on.

00:45:06 - Speaker 2: And Wulf, if I recall you have a blog post about these atomic attributes. I’ll link that in the show notes for folks.

00:45:14 - Speaker 3: Yeah, so unfortunately no relation between my name and Adam. It’s a TOM.

00:45:18 - Speaker 2: Yes, we have two Adams on this podcast. The ADAM is different from the ATOM.

00:45:25 - Speaker 1: Yeah. A big inspiration on this, by the way, is Tomic, I don’t know if we’ve mentioned that yet on this podcast, but Atomic is a database system developed by Rich Hickey and team who is also the creator of Closure. And it uses this model in contrast to the traditional relational model you have tables and columns and rows.

The atomic model is more like a bag of time stamped attributes where you have an entity, an attribute, a value and a time stamp. And from that, it could be more challenging to work with that model, but it’s infinitely flexible. You can sort of put whatever model you want on top of that, and it works well for creating a generic database system.

You know, you couldn’t have a generic post graphs, for example, that could run any application. You need to first create tables that correspond to the actual application you’re trying to build, whereas with an atom oriented database, you basically have one huge table which is atoms. So it’s useful again for having this slower moving more stable synchronization layer that handles moving data around that you build the application on top of that moving quickly.

00:46:27 - Speaker 3: Yeah, and like we talked about earlier, it’s so much simpler to reason about. All of the problems of my iPad is on version 1, my Mac is on version 2, and my iPad Mini is on version 3. They’re sending data back and forth. At the end of the day, every single database on all three of those clients is gonna look the same, even though they have completely different logic, maybe different features. But all the simplicity of that data store makes it much, much easier to reason about as the application gets upgraded or as two different versions of the client are synchronizing back and forth.

00:47:03 - Speaker 2: How does that work in practice? So I can certainly imagine something where all of the data is sent to every client, but a V1 client just doesn’t know what to do with this new field, so just quietly stores it and doesn’t worry about it. But in practice, what happens if I do have pretty divergent versions between several different clients?

00:47:23 - Speaker 1: Recall some podcasts ago, we suggested that everything you emit should have a UU ID and a version. Well sure enough that’s advice that we take to heart with this design, where all the entities, all the messages, everything has a UU ID and also everything’s version, so there’s several layers of versioning. There’s the network protocol is versioned and the application schema is versioned. So by being sure to thread those versions around everywhere, the application can then make decisions about what it’s gonna do and Wulf can speak to what the application actually chooses to do here.

00:47:54 - Speaker 3: Yeah, exactly. If we’re sending maybe a new version of a piece of data on the network layer that device A just doesn’t physically know how to decode from that work, then it’ll just save it off to the side until it eventually upgrades and then it’ll actually read it once it knows what that version is.

00:48:11 - Speaker 2: So is there someone like, can I make a crude metaphor here, someone emails me a version of a Word doc from a later version that I don’t have yet, I can save that on my hard drive, and later on when I get the new version, I’ll be able to open the file.

00:48:25 - Speaker 3: Yeah, exactly right. It’s very similar to that. And then I think there’s a different kind of upgrade where we’re actually talking the same language, but I don’t know what one of the words is that you said.

So let’s say we add a completely new content type to muse called the coffee cup, and everyone can put coffee cups on their boards, right? That coffee cup is gonna have a new type ID attribute that kind of labels it as such.

New clients are gonna know what type 75 means coffee cup, and old clients are gonna look at type 75 and say, oh, I don’t know about Type 75, so I’ll just ignore it.

And so the data itself is transferred over the network schema and kind of the app schema and understands those versions, but it might not understand the physical data that arrives in the value of that atom.

And in that case, it can happily ignore it and will eventually understand what it means once the client upgrades.

And so there’s a number of different kind of safety layers where we version something. If we’re unable to even understand kind of the language that’s being spoken, it’ll be saved off to the side. If we do understand the language that’s spoken, but we don’t understand the word, we can just kind of safely ignore it, and then once we are upgraded, we can safely understand both the language and the word.

00:49:47 - Speaker 1: Yeah, so maybe to recap our discussion of the low level synchronization protocol before we go on to the developer experience and user experience, might be useful to walk through a sort of example.

So suppose you are doing a thinking session in your nice comfy chair on your iPad, you’re offline, you’re making a few dozen edits to a series of different boards and cards in your corpus. Those are going to write.

New atoms in your database, and those are essentially gonna be flagged as not yet synchronized, and then when you go online, those atoms, assuming it’s some plausible number, you know, it’s maybe less than 1000 or so. Those are all gonna be combined into a single network message.

So this is that multiplexing efficiency where you certainly don’t need to check every document in your corpus, and you don’t even need to do one network right per every edit or even one network right per document. You can just bundle up all of your recent changes into a single protocol buffer message and could potentially compress it all with GSIP, and then you send that out to the server.

The server doesn’t know anything about these edits, you know it’s just one big binary packet. The server persists that, and then it sends a response message back to the client and says, OK, I’ve successfully received this. You can now treat this as synchronized and the server will take responsibility for broadcasting it out to all the clients.

And then clients as they come online, if they’re not already online, they will immediately receive this new packet of data called a pack. And then they can decompress and unpack that into its constituent atoms, and once they’ve processed those, tell the server, I have successfully saved this on my device and in the background of the server is essentially maintaining a high watermark of for each device that’s registered for this user, what’s the latest pack or block they successfully persisted, and that way as devices come on and offline, the server knows. What new data needs to send to each individual device, and that works both for essentially conveying these updates in near real time as they happen, as well as for doing big bulk downloads if a device has been offline for a long time.

And I know we’ve mentioned a few times, but to my mind this multiplexing and batching and compression is so important, so it’s the only thing that makes this even remotely feasible with the Muse data model of having a huge number of objects. And then I think this leads pretty naturally to a discussion of the developer experience. So we’ve talked about this sort of sync framework, and that essentially is gonna present a developer interface up to the application developer. So Wulf, maybe you can speak a little bit to that.

00:52:19 - Speaker 3: Yeah, we’ve talked some about the simplicity that we’re aiming for, just conceptually and how synchronization works.

I think it’s equally important for this to be extremely simple for the end developer to use as we’re building new features in use or as we’re, you know, changing the user interface around.

That developer, whether it’s me or Julia or anybody else working on Muse, doesn’t need to be able to think in terms of sync at all. We just need to be able to write the application is the ideal world, needs to be very, very simple.

And so keeping that developer experience simple was a really big piece of designing what sync looks like inside of Swift for iOS.

Since we had been built on core data beforehand, a lot of that developer interaction ends up looking extremely similar to core data. And so we build our models in Swift, it’s a Swift class. We have all of the different attributes where there’s position, and size, and related document, and things like that, and we just stick what’s called in Swift a property wrapper, it’s just a small little attribute. In front of that property that says, oh by the way, this thing, this thing belongs in the sync database. This property is gonna be merged, and that one little piece of code, that one little kind of word in the code program is what fires up the sync database and the sync engine behind it to make all of this stuff work. And that has been really important both for conceptually building new features, but also for migrating from core data to sync. Because the code that core data looks like beforehand, and the code that sync looks like now, is actually extremely similar.

Early on in the development process, our very first kind of internal beta, internal alpha, pre-alpha, whatever version you want to call it. Very early on in the process, we actually ran both core data and the sync engine side by side. So some of the data in Muse would load from core data and other bits and pieces would load from sync, but both of those, because they looked very, very similar from the developer’s perspective, from kind of how we use both of those frameworks. It allowed us to actually slowly migrate use over from one to the other, by replacing bits of core data with bits of sync, and then this little bit of core data with this little bit of sync. I mean there’s, you know, thousands and 10s of thousands of lines of custom logic to make muse muse. And so it was really important to keep all of that logic running, and to keep all that logic separate from the physical data that logic was manipulating. And so making those appear similar to the developer, let us do that. It let us keep all of that logic essentially unchanged in use while we kind of swap out that foundation from underneath it.

00:55:15 - Speaker 2: And I remember when you implemented the first pass at this persistence library, and I forget if Yuli was maybe away on holiday or maybe she was just working on another project, but then she came in to use your kind of first working version and start working on the sort of porting it across and she had a very positive reaction on the developer experience, you know, you are sort of developing the persistence layers so you naturally like it because.

Here, maybe the way you like it and you’re thinking about the internals, she’s coming at it more from the perspective as a consumer of it or a client or a user, and the developer experience, and I found that to be promising because I mean, she, like most, I think iOS developers has spent many, many years in her career using core data. Which is a long developed and well thought through and very production ready persistence layer that has a lot of edge cases covered and is well documented and all that sort of thing.

So in a way it’s a pretty high bar to come in and replace something like that and have someone just have a positive reaction to using it as a developer.

00:56:21 - Speaker 3: Yeah, I was so happy when she said that she kind of enjoyed using it and kind of understood how it worked, because of course every developer likes their own code, but when a developer can use and is comfortable with another developer’s code, that’s really, really important. And that was absolutely one of my goals is to make sure that it was simple for Julia to use and simple for any other developer that comes onto the Muse team that doesn’t have background in Muse and in our code base. For them to be able to jump in quickly and easily and understand what’s going on, was a really important piece of how this framework was built.

00:56:57 - Speaker 1: Yeah, I think this is a really important accomplishment, and Wulf is maybe even underselling himself a little bit, so I want to comment on a few things. One is, while there’s just this simple annotation of I think it’s at merged is that it Wulf. Yeah, that’s right.

Sometimes when you see that, that annotation instructs the framework to do something additionally on the side, on top of the existing standard relational database, you know, like basically do your best, try to synchronize this data out of band with some third party service or whatever.

But this in fact totally changes how the data is persisted and managed in the system, so it’s sort of like a whole new persistent stack for the application. And I think that’s important because we constantly see that the only way you get good results on sync system, especially when you’re talking about offline versus online and partially online, it has to be the one system that you use all the time. You can’t have some second path, that’s like the offline cache or offline mode that never works. It needs to be the one, you know, true data synchronization and persistence layer. So I think that’s as important though. There’s another subtle piece here, which is the constraints that you have with the industrial setup. So a lot of the research on CRDTs and synchronization basically assumes that you have one or a small number of documents in memory, and that if you’re going to be querying these documents or managing synchronization of these documents that you have access to the full data set in memory. But it’s not practical for our use case, both because of the total size of memory and the number of documents we’d be talking about. So a lot of critical operations can take place directly against the database or the data layer can smartly manage bringing stuff in out of memory. It’s not like we have the whole new corpus up in memory at any one time, the system has to smartly manage what gets pulled up from disk in the memory and what gets flushed back, and then that. Introduce a whole bunch of like, basically cash coherency and consistency issues. So that’s a very tricky problem to manage, some of which are just hard engineering problems, but some of which would not even be possible, again, if we didn’t have this approach of owning the whole stack, we can control everything from top to bottom. So, for example, you might want to be able to query, select all boards where the title is Fu, and to be able to do that directly against the database with a single SQL query. Well, there’s no hope of doing that with a system where the data is stored opaquely to you. You need to be able to understand exactly how the data is persisted all the way down on disks so you can bring it up to memory efficiently. That’s a good example of the type of thing that we’re able to do with the system.

00:59:25 - Speaker 3: Yeah, exactly right.

And as new changes come over the network, if the object is already loaded into memory, then that object can immediately see those changes from the network, update its properties, and be immediately live in the user interface. And if that object is not loaded into memory, then those changes from the network just get saved directly to the database. And so the data that the object uses to hydrate its properties and to actually use its logic internally is the exact same structure as what’s stored in the database and it’s the exact same structure that gets sent to and from the network. And that consistency across all three of those places is extremely important, as Mark was saying, because it means that you’re not. Having to maintain 3 different representations of the same data, you’re only maintaining exactly 1 representation of the data, and that consistency is extremely important as hundreds or thousands of changes are going up to the network, down from the network, loaded from the database, and you’re moving all these things around in memory, keeping everything consistent in that one data format has been extremely important.

01:00:34 - Speaker 1: One other challenges I’ll mention here, I’m actually not even sure how you solve this. Well, so maybe you can enlighten me here.

Most of these systems again in the research setting, they use a functional reactive rendering approach, and briefly, this is where you give the renderer a complete snapshot of the world, and it renders the world, and then For an update, you just give it a complete new snapshot and it smartly looks at the discs between the old snapshot and the new snapshot and efficiently affects the appropriate updates in the UI.

This is important because that’s very amenable to the synchronization model where you have updates coming in from all over the place. You have updates from your local device, you have updates from the network, updates from disk. And it’s very convenient to be able to just merge those in to a single new view of the world and tell the UI to go figure it out. You know, I’m not really sure who edited what here, but something changed, please re-rendered the screen so it looks right. That’s a very convenient property that you have, for example, in a lot of JavaScript based systems that use React. But that’s not how the traditional UI stack works in Swift, as far as I understand it, it’s a more imperative model if you say, put this thing here, change the color to this, make the dimensions this, and that can be very efficient and it’s certainly straightforward when you’re getting started, but it’s not clear to me how we actually affect the right changes when you have these updates coming in from all over the place. So maybe you can speak a little bit to that, Wulf.

01:01:49 - Speaker 3: Yeah, and some of that is actually borrowed from core data where in core data you load up a context, you make a bunch of changes to the model, and then at the very end of that you say, OK, save everything, go, and Core model kind of held everything in memory for a little bit, but not until that last save command does it actually physically write to the database and kind of ensure that it’s permanent.

And something similar happens for us where if there’s 100 changes that come down over the network, we apply all 100 of those changes, but then at the very end of that network call, there’s an, oh by the way, I just saved a bunch of stuff, call that goes out, and that notification goes out to the rest of the code base that says, hey everybody, I just changed these object IDs and these attributes inside of these scopes.

And so, we haven’t quite talked about scopes and object IDs, but scopes are basically a bag of objects and objects are a bag of attributes.

You can think of it that way.

So when that notification goes out, then the rest of the UI already knows, hey, I’m displaying board A with cards B, C and D, and so when that notification goes out that says, hey, by the way, everybody, I just updated whatever object B is, good luck, then our interface can say, oh, I have a card name to B. Great, I’m gonna go update that and make sure it’s OK. And so the physical object and memory actually does get updated immediately.

When network requests come in, but the interface updates only after a save command or only after some very specific notifications that show up. And that has been very important because if there are 100 updates coming down from the network, I don’t want to update the user interface 100 times. I want to update the interface once after all those 100 updates have been processed, and that makes things a lot quicker for us to be able to say. When do I update, why do I update, and which object is it that caused this update? All of that information lets us efficiently update the interface as changes come in or as changes are saved.

01:03:55 - Speaker 1: So it’s coalescing and batching all the way down, you’re saying.

01:03:58 - Speaker 3: Exactly, yeah, it’s a giant pile of coalescing and caching all the way down as turtles all the way down for sure.

I think one last piece of the developer experience that has been really important is we talked about this a bit with the protocol buffers and how that uses a lot of code generation. For all of the models in Swift are generated from that single protobuff schema, but then once that schema is built, we actually do a second round of code generation using a program called Sorcery, and we can link that down below as well, but that actually reads the SWIFT code that is generated by Protobuff and lets us use templates to create even more SWIFT code from that original SWIFT code and so, One important thing is type safety, which we’ve talked about with Protobuff, and what type safety gives us is compile time errors, if there’s anything wrong. And so a big piece of what we could generate is that query syntax to say, hey, give me all the boards with title Fu, making sure that that returns boards and not cards and not URLs or tweets or anything else. All of that code for searching our database is. Code generated off of the code that was generated from Protobuff. And so that’s been another really, really helpful piece is there’s probably hundreds and hundreds of lines of code that are guaranteed type safe and generated for us to use, which prevents us from making type unsafe bugs in the code.

01:05:40 - Speaker 2: So the beta has been online for a few months, we’ve been using it internally longer than that.

I think I’ve been using it kind of as my primary use place for about 5 months, and we’ve had something like 400 beta testers, which is A big enough number to give it a real solid run even though it’ll be a small fraction of the folks that will be using it after launch.

Now it would be great to hear both the user experience side of what we’ve learned from using a system like this in practice, again, not just in the research context, as well as the what has it been like debugging problems on the client side and kind of running a production server at this scale. Yeah, I’d love to get into the lessons learned with you guys a little bit.

01:06:30 - Speaker 1: I would say that especially from the server and protocol side, it’s been going very well so far.

We spent a lot of time going back to several years of research in the lab to try to understand how to correctly architect and design a system like this, and I think we were able to bring a lot of those lessons to bear.

I think it’s basically working well, like the model that we have with atoms and objects and scopes and a separate network and outplay that’s all working great. And like I said, I basically haven’t touched the server. And almost half a year now, it’s just chunking along, even as we add all these new users, so that’s great. And well, how do you feel it’s been going on the client side?

01:07:04 - Speaker 3: I think the interesting thing for kind of bug finding and bug fixing is how the network relates to what’s going on on the client. There’s been some either whether it’s performance or whether it’s just logic bugs or user interface bugs, where it relies on.

Your iPad is currently in this state and it receives this network message at the same time that the user is doing this action, and it manifests in this way.

And so then setting up a reproducible case. To make sure that I get that same network message in can sometimes make it a very slow iteration to debug, because I need to constantly set up the network to do the right thing, then set up the iPad to have the right initial state, watch everything happen to reproduce that one bug. But once I’ve found kind of what those messages are, then it becomes a lot easier because I can set up a unit test locally, where I can actually spin up multiple clients locally inside of the unit test and say, OK, device A creates this, device B creates this, I synchronize, I see the bug, now I can fix it. And so some of the I’m not sure if this is a lesson so much as a a war story, but it’s just the difficulty of making sure that the initial state and the network state are easily reproducible to find the bug. And then once you’ve found it, it’s actually really straightforward to, you know, fix it, but just finding the cause of the bug sometimes in this kind of distributed system can be a bit tricky because what I see on my iPad with the same data that you see on your iPad. Might function very differently depending on what we have going on in the network, or how big our corpus is, or what else is going on in the background.

01:08:57 - Speaker 1: Yeah, it actually makes me think that an important lesson learned is that the client is actually much harder than the server, and the reason is the client is much more multidimensional.

There’s the data side and the application logic side, and the rendering side, and there’s a bunch of third party code in there and there’s, you know, batteries and networks. It’s basically a wild environment, whereas the server is just very regimented. There’s like a dozen messages that come in and out. It’s all. In one single queue, it has a single database that it’s talked to, and it’s complex and that the whatever, 2000 lines of code are very subtle and you got to get them right.

But once you figure things out, it’s pretty nailed down, whereas, like I said, the client side is just wild. And so that’s an important lesson that you just got to expect that when you’re shifting a lot of your model to the client, you got to invest more there.

I’m also saying on debugging and debugability investing there always pays off, so we mentioned IDs inversions. I’m a huge fan of logging, which is log everything everywhere, and our standard model whenever we encountered a bug, is whoever saw the bug identifies the request ID that corresponds to when that happened, and then we just produce the logs and both the client and the server that are tagged with that ID and then you can go from there. That’s been very important.

Another thing is assertions. I’m a huge fan of assertions. I can’t tell you how many times I’ve added assertions to this code base. I’m like, there’s no way we ever trip this up, you know, how could we ever possibly send a blank value here? And then sure enough, two weeks later I get this alert on Century, you know, someone submitted a blank string for this attribute. What? How did that possibly happen? Oh, it’s well, you know, A and B and C and yeah sure enough that’s what happened. So it’s really important to add these ratchets, type checks, the assertions, the validations, it all pays off.

01:10:35 - Speaker 3: Yeah, I’m glad you mentioned logging. I think having those unified IDs for those network messages back and forth and having really verbose logging on both the client and the server, when something pops up and we have a problem, we can look through those logs and actually trace down exactly kind of what went wrong and reproduce that state.

And that’s another honest benefit of this kind of bag of atoms state is that we’re only ever adding items to this bag, we’re never changing items that are already in the bag, and we’re never removing items from the bag.

And so we can also just look through the bag and see, OK, I know that I got change A, B, and C, did they end up in the bag or not? I have A and B, but where did C go? And so being able to have those identifiers and have such a simple conception of what the sync database physically looks like, really lets us get down to the cause of almost anything we’ve run into we’ve been able to solve with logs and identifiers, those two things have been able to point us in the right direction.

01:11:41 - Speaker 1: I’ll also come back to multiplexing one last time. It was such an important investment for us to make for the system to work, and I actually realized there’s one step of multiplexing that I forgot to mention.

So we talked about how the clients, they all gather up their edits into one or a few packs. Those all get sent to the server, by the way, those packs can be bunched up into a single message, so that’s another layer of packaging and compression, but then the server.

The server has this challenge of there’s basically a zillion clients in the world and one or a few servers and databases and needs to manage all of the rights.

So if the sync server did a post graph database right for every time someone did an ink stroke or moved a card, there’s no way that would work. The round trip times alone would be prohibitive. And even if the server did a database right for every time a pack of changes came in from an individual device, it would still probably be prohibitive because eventually you’re gonna have a lot of devices online. So the server actually does is that again, it does a coalescing and a batching and every, I think it’s 5. 100 milliseconds, it takes all of the edits that have arrived in that time frame and writes them into the database in a huge bulk transaction, and then fans out the responses to all the devices. That’s another layer where if we didn’t have that, I don’t know if it would fall over immediately, but certainly as we get more users, it would become unviable. So I think this idea of multiplexing back and forth is really important, and it’s honestly been a huge amount of work, but now that we have it, I think it’s gonna be really key.

01:13:09 - Speaker 2: And I’ll weigh in on the lessons learned from the user experience side. First, there’s certainly the response we’ve gotten from a lot of beta testers, which is, I would actually just call it surprise at the speed and responsiveness of the syncing.

But for me and my heavy use of it, I’m a heavy Muse user always have been, but I think especially leading up to this launch, there was a lot of strategy work to do and a lot of ideation work to do in terms of all the storytelling materials and so on.

So I found myself using it especially heavily and having the Mac app also increases the utility and so on, but it’s almost confusing that on one hand, you get this thing that is a native app and everything happens instantly, and there’s never a spinner, and it never says now loading, and there’s never the stutter that you just come to expect from web SAS apps. But at the same time, it has this incredible liveness. So I would say that maybe coming back to comparing some of these other types of systems like iCloud is a good example where iCloud is quite good in terms of never blocking the application while you wait for something, but then it might be. 30 seconds till it propagates your other device and you’re sort of refreshing or staring at the screen or waiting for it to show up and uses between my Mac and iPad has that liveliness that I’ve come to associate with the Google Docs or a Figma, but then it doesn’t have that flakiness that I’ve come to associate with those things, and indeed I have often used more typically my iPad when I’m in an offline environment traveling or something like that. And yeah, it’s just so use the word freeing early in the discussion here today, Wulf, which is freeing you of the worry that you’re gonna hit that hiccup, your network’s gonna hit that hiccup, and suddenly you can’t access what you’re doing or you’re gonna get stuck. But at the same time, I can feel complete confidence or feel completely comfortable that things are gonna get synced and I’m not gonna get, for example, a weird merge conflict as I do sometimes with Dropbox, much as I love Dropbox and A major tool in my tool kit for a lot of years. I have also had things sort of disappear or seem to get lost because there’s a merged conflict that sticks the conflict off to the side, and I don’t notice that it’s there until two weeks later, something like that. So Muse somehow manages to really get the best of both of those worlds with the local first sink. And in a way that almost feels like getting away with something, or you’re cheating somehow, it’s no way, you can’t have both of these things together, you have to pick one. So that to me is amazing, a little magical really.

01:15:47 - Speaker 1: And relatedly, I’m also very happy with our investment and giving the user visibility into what’s happening with sync.

We learned in our research at the lab on sync that there’s the actual performance of sync, then there’s the like consistency and reliability of it, and then there’s the layer of being honest with the user about what’s happening, and it’s often not the sync is slow or it’s not working that gets you, it’s lying to the user about it or not giving them any way to understand what’s going on.

And so we make as an absolute first class citizen in this protocol, the ability to understand exactly how many bytes you have left to sync of both the transactional blob types.

So anytime you can always see how much do I have left to sink. Is it growing or shrinking, that is, is stuff getting added into my bag faster than I’m able to sink it down. And you get a definitive message when you have 0 left. Like you can say, OK, I’ve in fact synchronized every bite that’s known to my universe of Muse Corpus. And we reflect that in the eye, and I think that visibility into the rock style and sinking performance is really key.

01:16:51 - Speaker 2: Yeah, so if you’re getting a chance to try out Muse 2.0, uh, I definitely recommend just looking in the lower right corner of either your window or your screen where you’ll see a little circle, and it can be filled in or not filled in, depending on your network status, and it will kind of pulse a little bit, just very gently when it’s.

Working, uploading or downloading, inspired by the old school hard drive indicator lights, and then you can tap on it or click on it to open and get detailed information about the amount remaining to upload and download. It’s very simply done, but it helps give you that visibility precisely, as you said, and love to get back some time to talk about the design work that went into that because we think that part of things is really important.

Well, with the launch around the corner, I’m really excited to see how this scales in practice. I’m sure we’ll have a few server meltdowns and difficult bugs and so on, but it is very much the proving out this technology in a way to help take it from the lab and take it from the world of theory and research.

And show that it is something that can be done in the real world and is valuable and beneficial to customers that they find that their software works better and they are able to work better and in the case of M’s domain that they are able to think better because they have this fluid connection between all their devices.

So what are some of the things we’ll be building on top of this technology foundation in the future?

01:18:21 - Speaker 3: I would love to see an eventual web version of Muse and taking the JavaScript sync client that we’ve written and actually flushing out what a browser version of Muse might look like. I think that could be really interesting. And then of course there’s options for Android or for any of the other tablets that are out there, I think would be fun. I think being able to share any board with a link over the web and just have a full new experience load up, I think would be really fascinating use of sync and collaboration.

01:18:53 - Speaker 1: Yeah, I think a web client could be very interesting.

There’s also the iPhone, maybe that’s obvious, but we currently have this very minimal iPhone app and we’ve long said that the phone is one of the three pillars, is one of the three key device types.

It has its own unique use case.

So I think to realize that vision will need to have a full blown in its own way use client for the iPhone.

And then looking further afield, the system was designed for day one to support not only multiple devices from the same user, but also multiple users collaborating on documents and that actually required quite a bit of design affordances.

Sort of inserted into the structure here so that we’d come back in a year or two and like, you know, turn this key and be able to enable collaboration.

So looking forward to potentially exploring that and then looking even further beyond that, this could be a very important enabler for end user programming.

You think about like how do you program data that you don’t even have access to? Do you like just HTP get it all and then change it and then HT posts it back and that doesn’t really work in the world of traditional rest APIs, whereas if you have the full data set and it’s just sort of this live multiple thing where anytime you turn it or Touch it, it’s magically reflected through the ether on all the other devices.

I think that could be a really powerful substrate for doing end user programming eventually.

And then conveniently at this time, you’ll have this nice whole corpus of data that’s very important and personal to you that you would be motivated to manipulate with programs. So that could be a fun thing to explore.

01:20:19 - Speaker 2: All very exciting. So yeah, the multi-device path is really just the beginning and also will help us prove out that this technology can work and that it can scale and that our team can handle it. And speaking of team, we do have jobs related to this, so this sort of work sounds interesting to you and you have skill with CRDTs or anything we’ve discussed here, you should check out our jobs page, perhaps a local first job with the new team is in your future.

01:20:50 - Speaker 3: Absolutely, and if you’re excited and don’t have experience with CRDTs, that’s probably fine too. 2 years ago, I had no idea what a CRDT was, and yet here we are.

01:21:00 - Speaker 2: Now you’re on the forefront of the field.

01:21:02 - Speaker 3: Yeah, now we’re doing it.

01:21:03 - Speaker 2: Also at that point, I think you’ll be giving a talk at the, as I understand it, is the world’s first local first conference, which is happening here in Berlin in June as part of a larger academic event. You wanna plug that quickly, Wulf?

01:21:17 - Speaker 3: Yeah, I’m really excited about it. It’ll be on June 7th, will be the piece that I’m doing. There’s gonna be lots of talks about where Local First is going and how it’s being used in a whole number of different places. And so I’ll be giving a short talk on Local first in Muse and doing a little bit more detail about how our CRDT works and what some of that back end infrastructure looks like in production and making it live and so I’m really excited about it. I think it’ll be a lot of fun.

01:21:47 - Speaker 2: Let’s wrap it there. Thanks everyone for listening. If you have feedback, we’re on Twitter at @museapphq and we’re on email, hello at museapp.com. It’s always nice if you leave a review for us in Apple Podcasts. Mark Wulf, it’s been really a pleasure to watch as you two have poured your heart and soul into building the system over the last year, and I think it will be incredibly rewarding to see it out in the world, and I’m just really looking forward to hearing the response from everyone once they get a chance to give it a try.

01:22:20 - Speaker 3: Yeah, thanks for having me on. It’s been a lot of fun.

01:22:23 - Speaker 1: That was great. Thanks everyone.

Discuss this episode in the Muse community

Sync

Episode notes

Transcript

Metamuse is a podcast about tools for thought, product design & how to have good ideas.