Fun use of the Ragel State Machine Compiler to create a line parsing function on int argc, char * argv [].
It all started with the fact that the buildargv function was needed to parse the string for subsequent transfer to
int main (int argc, char *argv[]) { body }
Well, I thought, it cannot be that it was impossible to borrow anywhere, now we find ... And I did not find ...
Well, not that I would not have found it at all, for example, https://github.com/gcc-mirror/gcc/blob/master/libiberty/argv.c (GPLv2 is always good), I immediately take on such obligations was not ready. There is definitely such a function in bash (GPLv3 is even better). zsh? - go find (I found ... - I do not want).
In general, I didn’t find what I wanted, but I didn’t like what I found. Well, in the end I have the right to do it, all the same I make for myself a thirst for entertainment in the process.
I did not want to write this case in a conventional way from the word at all, I was even upset on this ground.
In general, we meet the Ragel State Machine Compiler.
Tools
- gcc;)
- ragel
- make
- lcov
- libcheck
The project can be found here: JOYFUL CMDLINE PARSER WRITTEN IN RAGEL
Formulation of the problem
At the input we have a string of any kind, the task is to get from the string an array of arguments separated by a space or tab, with:
- Any character following the escape character
\
must be ignored. - Any characters that are between two doubles or must
be considered one element - In case of unclosed
'
or"
, an error shall be returned
In general, there are not many conditions. And Ragel is quite suitable for this task.
Explained Implementation
Declare a machine with the name "buildargv" and ask Ragel to place its data at the beginning of the file (5.8.1 Write Data).
%%{ machine buildargv; write data; }%%
Next, we declare a lineElement
machine, which in turn consists of a union (2.5.1 Union) of two machines: arg
and whitespace
.
lineElement = arg >start_arg %end_arg | whitespace; main := blineElements**;
At the input and output of the arg
machine, the actions start_arg
and end_arg
respectively.
action start_arg { argv_s = p; } action end_arg { nargv = (char**)realloc((*argv), (argc_ + 1)*sizeof(char*)); (*argv) = nargv; (*argv)[argc_] = strndup(argv_s, p - argv_s); argc_++; }
Moreover, the start_arg
task start_arg
save the position of the character at the input, and the end_arg
task end_arg
add a new element to the argv
array, in case of successful exit from the arg
machine.
Now let's take a closer look at arg
.
arg = '\''> { fcall squote; } | '"'>{ fcall dquote; } | ( '\\'>{fcall skip;} | ^[ \t"'\\] )+;
It consists of a union of three machines '
, "
and (\ | ^[ \t"'\])
, the latter in turn is a union of \
and ^[ \t"'\]
respectively.
When we find the character '
we call squote
, '
we call squote
, or if the current character is \
call skip
, which skips any character following it, and any character is not 0x20
(space), 0x09
(tab), '
, "
or \
is considered correct .
It remains to consider a very small part:
skip := any @{ fret; }; dquote := ( '\\'>{ fcall skip; } | ^[\\] )+ :> ["] @{ fret; } @err(dquote_err); squote := ( '\\'>{ fcall skip; } | ^[\\] )+ :> ['] @{ fret; } @err(squote_err);
With skip
we have already figured out what does ^['\\]
also should not cause questions. And here :>
this is the Entry-Guarded Concatenation
(4.2 Guarded Operators that Encapsulate Priorities) its meaning is that the machine ( '\\'>{ fcall skip; } | ^['\\] )+
completes execution when ["]
changes to the initial state.
And finally, in the case of an end-of-line error with open quotes, dquote_err
and squote_err
to indicate and set the corresponding error code.
action dquote_err { ret = -1; errsv = BUILDARGV_EDQUOTE; } action squote_err { ret = -1; errsv = BUILDARGV_ESQUOTE; }
Code generation is carried out by the command:
ragel -e -L -F0 -o buildargv.c buildargv.rl
A list of test lines can be found in test_cmdline.c
.
Conclusion
The problem is solved.
Was it faster? I doubt it. More clear? If only you are an expert on Ragel.
I do not pretend to absolutism, I will be grateful for constructive comments on the Ragel code.
Material List:
[^ 1]: Adrian Thurston. Ragel State Machine Compiler .