Thoughts on cfront's potential improvements #84

ChAoSUnItY · 2023-11-11T07:35:33Z

Currently cfront is using scanless parser with IR emitter binds into it, and it contains ~3000 LOC. But based on my contributions experience to industrial grade programming languages (V Lang in this case), shecc's frontend parser is lack of ease to either debug or for others to contribute (Even though shecc is meant to be educational).

Here's a list of my thoughts on improving the frontend of shecc:

Discard scanerless parser, rewrite it into both lexer and parser.
Introducing Abstract Syntax Tree for better IR emission and backend generation.
Separation on the compilation phases into multiple files would be a better idea for potential contributors to learn shecc's architecture.

And by accepting the suggestion described above, the possible major changes would be:

Having more than 3 phases compilation in shecc.
The code generation logic must rewrite based on the introduced AST.

jserv · 2023-11-11T09:04:08Z

Currently cfront is using scanless parser with IR emitter binds into it, and it contains ~3000 LOC. But based on my contributions experience to industrial grade programming languages (V Lang in this case), shecc's frontend parser is lack of ease to either debug or for others to contribute (Even though shecc is meant to be educational).

I have undertaken the task of reworking my earlier compiler project, AMaCC, with a primary focus on its educational value and potential extensibility. I wholeheartedly acknowledge the limitations of the current C front-end implementation, including the absence of a robust AST and proper modularization.

At the same time, @vacantron is dedicated to introducing the SSA based IR following his initial efforts in register allocation. I want to make sure that we avoid any significant conflicts of interest when it comes to the reworking of the C front-end. Could you please consider proposing a plan for submitting pull requests that involve minimal changes?

ChAoSUnItY · 2023-11-11T16:26:21Z

No problem, for the minimal changes, I would like to try separate current parser into lexer and parser, so we can keep all lexical analyzing and grammar parsing separate and also keeps all IR related functions stays in the same place in cfront. This will only extract lexical analyzing functionality from cfront and it passes token stream to parser.

vacantron · 2023-11-11T17:12:10Z

Should we also rework the preprocessor? I think the current cfront with the compiler directives and macro expansion might cause some problems in your working.

jserv · 2023-11-11T17:51:30Z

No problem, for the minimal changes, I would like to try separate current parser into lexer and parser, so we can keep all lexical analyzing and grammar parsing separate and also keeps all IR related functions stays in the same place in cfront. This will only extract lexical analyzing functionality from cfront and it passes token stream to parser.

That sounds promising. You can track the ongoing migration to an SSA-based IR in pull request #85.

ChAoSUnItY · 2023-11-13T04:12:13Z

Should we also rework the preprocessor? I think the current cfront with the compiler directives and macro expansion might cause some problems in your working.

Yes, I'm currently investigating on it, current implementation is highly unreasonable and may produces some unexpected side effect, I'll try to handle preprocessor directives in parser in a more consistent way. (token based parsing instead of manual string parsing)

For the rework implementation progress, I'll keep update here.

Edit 1: To completely avoid post effect, I plan to completely extract it out from lexer and parser, which expands to another file, then re-read into token pipeline in lexer and pass to parser.

Edit 2: After discussion with Jserv, shecc will not have a separated preprocessor, instead, I'll focus on making it consistence in token-based parsing form.

jserv · 2023-12-10T15:55:34Z

For the rework implementation progress, I'll keep update here.

After successfully finishing #85 and #89, @vacantron will focus on improving the SSA IR and its related compilation process. Next, he plans to devote attention to #88, which involves an optimization phase based on SSA. In the meantime, it presents an ideal opportunity to revise and rework cfront.

jserv · 2023-12-16T05:54:09Z

Discard scanerless parser, rewrite it into both lexer and parser.

#92 is the starting point for this task.

jserv · 2024-01-07T07:12:01Z

I recently discovered a small-scale project laroc designed to create a C99 compiler for RISC-V. This project is intended to serve as a reference for improving the frontend and code generation aspects of C compilers.

idoleat · 2024-01-16T19:50:14Z

Edit 2: After discussion with Jserv, shecc will not have a separated preprocessor, instead, I'll focus on making it consistence in token-based parsing form.

I would like to know the reason not to have a separated pre-processor. Current pre-processing logic is mixed in both lexer and parse as special cases. If pre-processing is separated, the implementation may benefit from fewer states in lexer and less additional conditions in parser. Also new features (if planned), such as string concatenation in macro (##), might be easier to add with separated pre-processing. Or other parsing algorithms (like a more error-resilient one) could be easily tried out since dealing with pre-processing is not in need.

jserv · 2024-01-17T01:48:43Z

I would like to know the reason not to have a separated pre-processor. Current pre-processing logic is mixed in both lexer and parse as special cases. If pre-processing is separated, the implementation may benefit from fewer states in lexer and less additional conditions in parser. Also new features (if planned), such as string concatenation in macro (##), might be easier to add with separated pre-processing. Or other parsing algorithms (like a more error-resilient one) could be easily tried out since dealing with pre-processing is not in need.

This project draws inspiration from AMaCC, which in turn was influenced by the remarkable c4. All three projects share a common theme of minimalism, emphasizing self-bootstrapping without the need for external tools. This is precisely why this project eschews the use of separate assemblers and linkers, despite being a cross-compiler. Unlike mature compilers like GCC and LLVM, where the C preprocessor (cpp) is a distinct program, in our project, cpp is integrated into the lex/parser. This approach aligns with our minimalist design philosophy. While this integration adds complexity to the existing C front-end, I believe that the benefits of a more unified design principle justify this complexity.

ChAoSUnItY · 2024-02-24T18:06:54Z

As of the merge of #111, the work on cfront's job is considered temporarily completed, but still, I will leave this issue open for the following reasons:

The viability of seperation of parser and lexer is doubtful since memory usage will increase, if the source file's tokenization phase is completely done before syntactic analysis phase and the token information is stored in struct form.
As mentioned above, the tokenization strategy may have to be heavily changed due to the different parsing strategy used in cpp (C preprocessor for short) and the C language itself. More precisely, the newline (\n or \r) or the backslash (\) character may needs to be consider as an valid token in order to successfully parsed. Additionally, the token aliasing strategy will requires previous changes to be done (see Enhance and cleanup lexer-parser interface #107).
The preprocessor syntax validation, specifically, unused-token-after-expression validation, is unimplemented due to reason 2.

Generally, these issues requires addition investigation to be done in order to be resolved.

jserv assigned ChAoSUnItY Nov 11, 2023

jserv mentioned this issue Dec 7, 2023

Do we need to change how shecc evaluate expression? #28

Closed

This was referenced Jan 16, 2024

Migrate preprocessor directive handling #106

Merged

Enhance and cleanup lexer-parser interface #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on cfront's potential improvements #84

Thoughts on cfront's potential improvements #84

ChAoSUnItY commented Nov 11, 2023 •

edited

Loading

jserv commented Nov 11, 2023 •

edited

Loading

ChAoSUnItY commented Nov 11, 2023

vacantron commented Nov 11, 2023

jserv commented Nov 11, 2023

ChAoSUnItY commented Nov 13, 2023 •

edited

Loading

jserv commented Dec 10, 2023 •

edited

Loading

jserv commented Dec 16, 2023

jserv commented Jan 7, 2024

idoleat commented Jan 16, 2024

jserv commented Jan 17, 2024 •

edited

Loading

ChAoSUnItY commented Feb 24, 2024

Thoughts on cfront's potential improvements #84

Thoughts on cfront's potential improvements #84

Comments

ChAoSUnItY commented Nov 11, 2023 • edited Loading

jserv commented Nov 11, 2023 • edited Loading

ChAoSUnItY commented Nov 11, 2023

vacantron commented Nov 11, 2023

jserv commented Nov 11, 2023

ChAoSUnItY commented Nov 13, 2023 • edited Loading

jserv commented Dec 10, 2023 • edited Loading

jserv commented Dec 16, 2023

jserv commented Jan 7, 2024

idoleat commented Jan 16, 2024

jserv commented Jan 17, 2024 • edited Loading

ChAoSUnItY commented Feb 24, 2024

ChAoSUnItY commented Nov 11, 2023 •

edited

Loading

jserv commented Nov 11, 2023 •

edited

Loading

ChAoSUnItY commented Nov 13, 2023 •

edited

Loading

jserv commented Dec 10, 2023 •

edited

Loading

jserv commented Jan 17, 2024 •

edited

Loading