February 2024 – Rubberduck News

About a year ago, I came to the conclusion that it would not be realistic to refactor Rubberduck into a client/server architecture and to incrementally make the 2.x internal model work with the Language Server Protocol (LSP), which had been established as a target for 3.0.

To get the ball rolling I started a new project from scratch, and started writing a prototype to explore the AvalonEdit API for the client/editor part. The server side took longer to take off, because I simply dived head-first into the LSP specification, and hit a blocker with socket transport (which was the most efficient way to do this): it would pop an elevation prompt, and requiring admin rights to connect to the language server process was a complete showstopper, and then that’s about when I took my attention to the OmniSharp library… which implemented JsonRPC and LSP a million times better than I ever would have, and made my entire prototype model moot, so I happily scrapped it all and started over, and here we are.

Transport works over named pipes, with bidirectional JsonRPC (Remote Procedure Call) messages; JsonRPC is essentially a specification for formatting/structuring messages between processes (hence client/server), and LSP builds on top of it: OmniSharp deals with all this boilerplate for us.

A year later there’s a VBE add-in that’s just a menu with a few intriguing commands, like “Show Editor” and “New Workspace”… and nothing else. Rubberduck 3.0 will only have minimal interactions with the VBIDE, which is the polar opposite of what Rubberduck has been doing so far. As a result, the RD3 memory footprint in the host process is absolutely minimal, and the main/only thread of the host process is consequently very much left alone, unless the add-in is creating or synchronizing a workspace. This removes an entire, rather populous group of things that can potentially insta-kill the host process, making it the most stable Rubberduck add-in we could ever come up with. You can’t blow up the host process if you’re not crashing in the host process! That said knock on wood still… the VBE is life, it… finds a way.

The show editor command brings up the Rubberduck Editor, starting it if it isn’t already running. It’s a standalone Windows Presentation Foundation (WPF) application that runs in its own separate process, so if something terrible happens and the editor crashes, the add-in can just spawn a new one and carry on as if nothing happened. The editor app is a JsonRPC server here, and the add-in/host process is its client.

When the editor opens or creates a workspace/project (the add-in sends a URI as a command-line argument), it starts a language server process (another separate process!) and sends it the text content of all opened documents.

When the server process completes initialization, it reads the text content of all workspace documents that aren’t opened in the editor, and starts the parser pipeline.

In RD2 it’s a `RubberduckParserState` object that holds all the (mutable) state; in RD3 the (immutable) state consists of hierarchical symbols, some of which can resolve to a type – and they’re held in an ExecutionContext that maintains a symbol table and can copy itself into an ExecutionScope, …and you can already tell it’s a whole different beast.

Rubberduck 2.x was already doing something quite similar while pre-processing the precompiler directives, and the unreachable case inspection makes use of it… While impressive, it’s tacked on top of the parser rather than being the parser.

It all goes back to a fateful evening of 2014 when I somehow stuck with the idea that all Rubberduck needed was declarations, and then we’d resolve identifier references and have something to work with. Over the years that followed we made good use, and pushed the limits of this naive model that mangled the concepts of symbols and types into a catch-all declaration that could just as well be a class module or a line label.

So now we’re going to be resolving the type of symbols, and instead of lookup dictionaries giving us the declarations of a particular module, we get hierarchical symbols that simply know what their child symbols are, and a symbol table that contains everything that’s in scope inside the module member we’re looking at.

In other words, we’re going to be a few things short of an actual interpreter (not linking external types, for one), but more than close enough to be issuing diagnostics rather than inspection results.

Parser Pipelines

I recall when we retro-fitted cancellation capabilities into the RD2 parser, and made it asynchronous: in 3.0 the thing is asynchronous by nature, as it builds on top of the .net Task Parallel Library, but this time everything is happening inside Dataflow, a more abstract library that goes further than tasks and wraps them into “blocks” that connect to each other to move state through – so we start by giving it a `WorkspaceUri`, and then the pipeline gets (or creates) the current document state for each document in that workspace, and then depending on parallelization settings it can dedicate a thread to each document.

So far that’s implemented differently and more robustly, but conceptually the same thing happens in RD2. What happens once we have a syntax tree for each document is different though.

In RD2 we would be collecting all declarations and storing them in the DeclarationFinder, a service that the resolver uses to get resolution candidates – and that many other features use whenever they need to find a declaration.

In RD3 we’ll be collecting symbols in two distinct traversals of each module’s syntax tree. The first pass collects the member symbols, which includes everything in the declarations section of the module and each procedure scope, including its parameters; the second pass collects all the remaining symbols inside each procedure.

The declaration finder is not making it to v3: instead, each workspace is given an execution context, where a hierarchical symbol table is maintained. During the first symbol pass, this context gets all the module and member symbols, and once all modules have been traversed we resolve a VBType for each typed symbol, so the second pass has all the information it needs to resolve a VBType for all the remaining symbols.

As the syntax tree traversal enters a procedure scope, the context generates an execution scope, which is essentially a stack frame that has its own scoped symbol table, which only includes symbols that are accessible to the procedure we’re in. When the traversal exits the procedure, the resolved symbols are copied from the execution scope before the scope is dismissed; when the module has been completely traversed, all the resolved symbols get copied to the workspace’s execution context.

Once all workspace symbols are resolved, a semantic pass will be able to traverse each executable scope to issue various diagnostics.

A visual representation of how the pipeline dataflow blocks are currently interconnected.

The part that collects symbols from referenced libraries still needs to be ported from RD2, but I’m not worried about that part at all.

Current Status

As of this writing, most of the pipeline itself is done; what needs attention now is the exhaustive list of all possible semantic tokens, which is how the editor/client is going to be able to implement semantic syntax highlighting – the second symbol resolution pass needs to tokenize the syntax tree, and then the resulting tokens need to be sent to the client. But first, I need to categorize every single one of them. LSP specifies a handful of common kinds of tokens, but the default kinds are insufficient to correctly tokenize VBA code. Moreover, because semantic tokens responses are typically rather large, for performance reasons what’s sent to the client isn’t the tokens themselves, but an integer that represents it: a legend has to be crafted to map these integer IDs to semantic tokens on the client side.

Once that’s done, I’ll be wiring it all up and make the server start processing a workspace whenever one gets opened, …and to start processing a code file whenever it gets modified in the editor. Then the client can handle the server notifications about workspace symbols and semantic tokens being refreshed (and possibly even diagnostics).

Errors and Diagnostics

Now that’s another massive paradigm shift. Work hasn’t started on these yet, but we can already tell that RD3 inspections aren’t going to “run” per se. Rather, diagnostics are to be issued by the semantic pass, when the executable symbols are being interpreted within an execution scope.

Inspections that RD2 dubs parse tree inspections would become diagnostics issued during the initial traversal of the parse trees, along with syntax errors – i.e. error nodes in the parse tree. The hardest part of this is going to come up with an error message that makes sense: if you’ve experienced parser errors in Rubberduck before, you know these error messages require a very deep understanding of the parser rules to make any kind of sense, and turning these into human-friendly errors is a very difficult task that will likely not be completed by the time 3.0 is released. That said, last year’s prototype has confirmed that the editor will be able to render them as squiggly lines that show a tooltip when hovered; similar to modern editors, hovering diagnostics will pop a ducky button that lists all available code actions aka quick-fixes (and refactorings) for that diagnostic, if any.

Phase II

With the language server resolving symbols, the project is entering a new phase: in the next couple of weeks/months, RD3 will become much more than a glorified Notepad, and will start feeling more and more like an actual IDE. With access to the symbols, the editor can now implement all the features, so the rest is a matter of coming up with a decent backlog, and let the relentless march towards the first Rubberduck 3.0 alpha 1 release begin!

Month: February 2024

RD3: A Year Later

Parser Pipelines

Current Status

Errors and Diagnostics

Phase II