Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Performance & Maintainability. #3369

Open
3 tasks
andrevidela opened this issue Aug 9, 2024 · 5 comments
Open
3 tasks

Parsing Performance & Maintainability. #3369

andrevidela opened this issue Aug 9, 2024 · 5 comments

Comments

@andrevidela
Copy link
Collaborator

andrevidela commented Aug 9, 2024

Long term users of the idris programming language will know that one of the weaknesses of the compiler is
it's most common interaction model: the parser. Because the parser is quite a complex piece of software, there
is not a single element that we can point out and fix to resolve all the symptoms we experience because of it.

The role of this issue is to keep track of problems related to the parser, and different efforts deployed to solve
them.

This issue will be closed when the current state of the parser will be improve in at least two aspects: Performance
and maintainability. First, a non-exhaustive list of symptoms we experience and would like to address.

Existing and known problems

  1. Performance is subpar
  2. Inconsistent error recovery messages
  3. Unexpected whitespace rules
  4. Code architecture regarding location is brittle
  5. Code architecture around desugaring and resugaring is brittle
  6. No tree sitter support
  7. Code synthesis generates unparsable code

Next steps

Here are a list of smaller steps that are not necessarily related to code that we can push as a PR on the
project but we still need to do in order to make progress toward better identifying and scoping solutions.

  • Measure performance in multiple scenarios: Large files, interactive modes, visually ambiguous code, etc.
  • investigate options around tree sitter. Are there ways to generate idris-compatible code from tree-sitter? Is there a way to generate tree-sitter spec from the parser? What kind of custom lexer do we need to write?
  • Investigate a better API for file location that is compatible with desugaring and resugaring, and does not require mindful use of FC in constructors but can be handled automatically

Conclusion

There are definitely more things we could do but I thought those would be small, achievable targets in the currently underspecified
goal to address parser runtime performance and maintainability. Please feel free to add to this discussion and share ideas of
small goals we can achieve easily to better design issues. Please, also share your experience with the parser if it is not represented
here in a way that we might be able to address with this kind of project

@andrevidela
Copy link
Collaborator Author

andrevidela commented Aug 10, 2024

Some personal notes:

  • Last year @stefan-hoeck showed us a very efficient constant-time parser using his library idris2-parser. My personal assessment was that it's essentially the target to hit in term of performance, but re-writing the entire parser with it might be too much work.
  • At the beginning of the year I had a chat with @wenkokke about Tree-Sitter, it is undeniable that tree-sitter offers large benefits in terms of ecosystem and accessibility. Because of this, I personally think it is worth investing in a tree-sitter compatible solution. I started working on a prototype to get familiarised with the API but made no progress on the biggest piece of software that we need for an idris2 version: the whitespace lexer.
  • Across the years, many contributors, including myself, thought about tackling the problem of partial-evaluation for idris. But nothing materialised yet. At this point the problem is so big and complex it's a project in and of itself, because of that, I don't think we should think of it as a "quick fix for parser performance" but as a genuine research project.
  • There might be a way out of this local minimum by developping a solution akin to happy/bison/BNFC where we have a static definition of the grammar and generate idris2-parser code from it, and a tree-sitter grammar for editor support. This solution takes us away from the realm of research and into pure engineering, which might be more achievable given the resources we are given.
  • The parser code itself leaves at lot to be desired. Its monolithic architecture makes it hard to test in small fragments. The code is littered with file-tracking code that's unrelated to the parsing itself. The parser performs some desugaring steps before the desugarer, for example in the case of pi-types or sigmas. All this complexity makes any changes to the parser fraught with unexpected challenges. Cleaning this up is going to take some time and we need to be mindful of doing it in a way that enables our other goals.
  • While it is not directly related to the parser, the desugaring steps and resugaring steps have their own problems that I think should be adressed as part of any efforts to improve the parser: generating provably parsable code. The issue is that there is no way to prove that a fragment of code can be parsed, desugared and resugared in a way that it can be parsed again. In code, we want to have the following lemma: (code : String) -> parse code >>= desugar >>= resugar >>= print ~~~ code where ~~~ is a notion of parsing equivalence where both sides must both parse or fail with a similar error. Additionally we want the internal trees to be the same: (tree : PTerm) -> tree >>= desugar >>= resugar === Just tree
  • One very visible area where parser, desugaring and resugaring fail is in location tracking. Often, location is lost, inaccurate or wrong, essentialy because location tracking needs to be carefully threaded by the programming at each of those steps and as a programmer there is no way to ensure that the assumption we make will hold after multiple passes of tree transformation, least of which are the desugarer and resugarer, but go much deeper with features like proof search, with clauses, elaborator reflection, and more. Because elaborated code is much harder to deal with, and further away from the parser, I think they are out of scope for this specific project. But desugared code should definitely be aware of the parser's capabilities.

@andrevidela andrevidela pinned this issue Aug 10, 2024
@Matthew-Mosior
Copy link
Contributor

Matthew-Mosior commented Aug 10, 2024

Depending on the timeline to re-write the parser (either all at once or in chunks), I feel that it might make the most sense/provide the best return on investment to spend the time/effort to utilize @stefan-hoeck's idris2-parser library. The potential performance improvement alone may be well worth the effort.

@jmanuel1
Copy link

There's also an archived attempt at a tree-sitter parser at https://github.com/gwerbin/tree-sitter-idris2 and an active one at https://github.com/kayhide/tree-sitter-idris that seems to be useful for my purposes (program reduction).

I'm most interested in a parser that can produce partial results, in case of an error, and that can produce a concrete syntax tree. I know tree-sitter does the former, but I haven't investigated its ability to do the latter well. These abilities would be useful for a code formatter, too.

@wenkokke
Copy link
Contributor

wenkokke commented Aug 11, 2024

I'd love to support a tree-sitter implementation, but this complicated the manner in which Idris is currently bootstrapped, since it would introduce a dependency on a C library.

It might be better to implement a parser library in Idris which produces incremental and streaming results and has error correction.

That said, the implementation effort would be significantly more complicated than simply reusing tree sitter.

@andrevidela
Copy link
Collaborator Author

With #3450 I started moving some of the logic that the parser performs into the desugarer, with three goals in mind:

  • Create a clear separation between surface syntax and desugared syntax. Eventually, the goal is to provide a faithful bidirectional mapping between those two trees, but first we need to even know what was in the file to begin with.
  • Simplify parsing such that we can reconstruct a BNF-style set of parse rules, those rules can then be repurposed for a future, more performant and more flexible parser
  • Automate as much as possible the location tracking code. This is quite an invasive change since it also changes TTImp and breaks elaboration scripts, that is why, for now, I've only focused on surface syntax.

This is a WIP with the limited time I have available but hopefully it moves the needle a little bit for the problems 4, 5, 6, and 7 listed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants