Proposal: Improve whitespace handling #4668

macsux · 2024-11-13T17:14:50Z

What problem are you trying to solve?

Current LST modelling of whitespace/comments is very awkward, difficult to reason about, and requires use of "padding" classes. The problem arises in a number of areas:

Since currently LSTs do not have a representation for tokens (ex: language specific keywords, semicolons, commas, brackets, etc), when whitespace need to be captured around those parts of syntax, the tree node has to have "additional" Space element to put it. We currently do this by wrapping the whole tree node into JRightPadded or JLeftPadded element which serves no other purpose other then to model this extra whitespace. It is not actually part of a logical "semantic" tree and is a transient node most do not care about. It's presence is why all LSTs expose it's underlying element for all get/with methods, and only give access to it via a "buddy" padding class.
Due to "left-side" whitespace by default approach, certain whitespace / comments end up belonging to nodes they don't logically belong to. Example

var a = 1; // about a
var b = 2;

would result in //about b comment being part of white space belonging to var b = 2;, which is clearly not what it's logically related to.

Since it is desirable to make LSTs work in multiple languages, this approach causes issues when other languages make use of tokens not accounted for in original language, often preventing their direct reuse and require them to be wrapped in language specific LST just to be able to deal with extra tokens and the requirement around extra white space modelling.
In some cases optional language specific tokens that result in specific keywords and associated whitespace to be emitted. This modeling is "awkward", inconsistent and implicit in rather then explicit in its effects. An example of this is varargs on VariableDeclarations used to model the handling of whitespace around ... token in cases like this void printNumbers(int... numbers). While the type is Space, the rendering of token itself ... is implicit since it is tied if the Space is null. An unsuspecting user may set it to Space.Empty assuming it would not any change in rendering, but implicit logic tied to this argument would cause new token to be rendered, even though it's not explicitly present in LST. This problem often comes up when modelling support for other languages as well where optional tokens and keywords may appear.

Describe the solution you'd like

Make Tree have two Space properties: leadingSpace and trailingSpace. This would clean white space modelling that doesn't require "wrapping" nodes of JLeftPadded and JRightPadded, and remove the need for padding class. It would also be possible to correctly associate white space to it's logical node.
Provide explicit support for tokens while making them abstract enough to be applied cross language by adding TokenSet tokens to Tree. Each token would have a semantic meaning (aka Kind) rather then explicit character to express it's intended use in LST composition. For example something that is a semicolon in java may be a comma in C#, yet their intended use is the same - separating multiple repeating sibling LST elements in a list type manner. This would facilitate for lossless transfer of token based information without having to capture it in any "wrapping nodes", and "generic" recipes can work with token information relying on its semantic meaning in expression composition rather then language specific semantics.

Another area where this would work well is ability to model optional, language specific tokens without having to "pollute" other language modelling (like java's) with things that don't appear there. Consider example of a method declaration between java and c#:

Java:

public void MyMethod()

C#:

public async void MyMethod()

Here we have async keyword that may appear in c# version. Atm, it requires the whole J.MethodDeclaration to be wrapped in Cs.MethodDeclaration just to allow for this extra keyword capture. However, if Tree contained "grab bag of tokens" such as TokenSet tokens, it can be used to place a Token of type Async (and the related whitespace associated with it).

C# specific LSTs can be augmented with get/with methods to help manipulate this keyword as if it was a property for more user friendly access, yet use the underlying "token grabbag" as the storage mechanism. Ex

public Token? AsyncKeyword => this.Tokens.Find(TokenKind.Async);
public MethodDeclaration WithAsyncKeyword(AsyncKeyword? token) => this.Tokens.Set(token);

(TokenSet.Set method would remove it from the TokenSet if value is null, or add/replace existing value)

The text was updated successfully, but these errors were encountered:

macsux added the enhancement New feature or request label Nov 13, 2024

github-project-automation bot added this to OpenRewrite Nov 13, 2024

macsux changed the title ~~Proposal: Improve whitespace and non-syntax related metadata handling~~ Proposal: Improve whitespace handling Nov 13, 2024

timtebeek moved this to Backlog in OpenRewrite Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Improve whitespace handling #4668

Proposal: Improve whitespace handling #4668

macsux commented Nov 13, 2024 •

edited

Loading

Proposal: Improve whitespace handling #4668

Proposal: Improve whitespace handling #4668

Comments

macsux commented Nov 13, 2024 • edited Loading

What problem are you trying to solve?

Describe the solution you'd like

macsux commented Nov 13, 2024 •

edited

Loading