Skip to content
Iury O. G. Figueiredo edited this page Jul 6, 2020 · 25 revisions

Welcome to the crocs docs!

This wiki describes the features and the benefits of using crocs.

Introduction

In Crocs regex's become classes and you use these classes to describe your patterns, a regex pattern that is written using Python is named a yregex.

Once a pattern is constructed then it is possible to compile the python classes structure to a regex string. It also gives you possible hits for the pattern. It means that you can get a strong idea with which strings your pattern will match.

Regex to Yregex

There is the xmake function that is used to convert from regex to yregex. The xmake function parses a given raw regex string and builds a yregex structure that can be serialized to Python code.

from crocs.xparser import xmake
yregex = xmake(r'(a.b)')
yregex.test()

That gives you:

Input: 'adb'
Regex: (a.b)
Group dict: {}
Group 0: adb
Groups: ('adb',)

What if you wanted to check the python code that would produce the yregex?

You would just need to call the method mkcode.

print(yregex.mkcode())

That would output:

from crocs.regex import Group, X
from crocs.core import RegexStr
x0 = X()
group0 = Group('a', x0, 'b')

Joining Patterns

A yregex pattern is a structure of class instances that are grouped accordingly. T here is the Pattern class that is used to join patterns and build a single one.

For instance consider a string is a pattern:

from crocs.regex import Pattern
e = Pattern('a', 'b', 'c', 'd')
e.test()
e.hits()

The Pattern constructor accepts regex's classes to glue them together so forming a master pattern.

That gives:

from crocs.regex import Pattern
e = Pattern('a', 'b', 'c', 'd')
e.test()
Regex: abcd
Input: abcd
Group dict: {}
Group 0: abcd
Groups: ()
e.hits()
Match with:
 abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd

Wildcard

The regex's wildcard character in crocs it becomes a class. It can be used to product patterns with other existing classes.

from crocs.regex import Pattern, X

e = Pattern('a', X(), 'b')
e.test()
e.hits()

That produces:

Regex: a.b
Input: a.b
Group dict: {}
Group 0: a.b
Groups: ()
Match with:
 a9b alb a|b aVb apb aqb arb aAb a[b a;b

Include/Sequence

Regex's sequences are mapped to the Seq class. Such a class receives two arguments which are used to delim the start and end of the desired sequence.

from crocs.regex import Pattern, Include, Seq

e = Pattern('x', Include(Seq('0', '9')))
e.test()
e.hits()

That would give you:

Regex: x[0-9]
Input: x8
Group dict: {}
Group 0: x8
Groups: ()
Match with:
 x8 x4 x6 x1 x0 x0 x2 x7 x4 x2

In order to better elucidate:

from crocs.regex import Include, Seq

e = Include(Seq('a', 'z'))
e.test()
e.hits()

Which would output:

Regex: [a-z]
Input: t
Group dict: {}
Group 0: t
Groups: ()
>>> e.hits()
Match with:
 v o g q v p t x l f

Exclude/Sequence

There is also the Exclude class that is used to represent the regex [^...]

from crocs.regex import Pattern, Include, Exclude, Seq

e = Pattern(Exclude('abc'), Include(Seq('0', '9')))
e.test()
e.hits()

That would give you:

Input: '45'
Regex: [^abc][0-9]
Group dict: {}
Group 0: 45
Groups: ()
Match with:
 e0 n0 "3 ?0 o0 -0 <1

Repeat

The Repeat class is used to describe number of times a given pattern has to occur in other to be classified as a valid pattern.

The example below clarifies the usage.

e = Pattern('a', Repeat('b'), Repeat('c'))
e.test()
e.hits()

Would output:

Input: 'abccccccc'
Regex: ab{0,}c{0,}
Group dict: {}
Group 0: abccccccc
Groups: ()
>>> e.hits()
Match with:
 abbbcccccc abbbbcccc abbbbcc abbbbbcccc abbbbbbbcccc ab accc

This example shows how to repeat a string lengthier than 1.

from crocs.regex import Pattern, Repeat, Group

group = Group('foo')
e = Pattern('x', Repeat(group))
e.test()
e.hits()

Output:

Input: 'xfoofoofoofoofoo'
Regex: x(foo){0,}
Group dict: {}
Group 0: xfoofoofoofoofoo
Groups: ('foo',)
>>> e.hits()
Match with:
 xfoofoofoo xfoofoofoofoofoofoo xfoofoofoo xfoofoofoofoo xfoo xfoofoofoo xfoo

The Repeat class accepts a single yregex struct and it can't be instance of Pattern, Any. It can be instance of str but the length has to be equal 1. The reason is avoiding ambiguity.

Group

The Group class is used to group patterns together and making it possible to reuse other Regex operators to build new patterns. It also allows a mechanism to record the group patterns for retrieving the data that the group pattern matched.

from crocs.regex import Pattern, Group, X

e = Pattern('a', Group('b', X()))
e.test()
e.hits()

That would output:

Regex: a(b.)
Input: ab@
Group dict: {}
Group 0: ab@
Groups: ('b@',)
>>> e.hits()
Match with:
 abf abi ab: abm aba abe ab4 abl abs ab;

This example shows group reference.

from crocs.regex import Pattern, Repeat, Group, X

group = Group('x', X())
repeat = Repeat(group, 1, 4)
e = Pattern('x', repeat, group)
e.test()
e.hits()

Would output:

Input: 'xxzxzxzxzxz'
Regex: x(x.){1,4}\1
Group dict: {}
Group 0: xxzxzxzxzxz
Groups: ('xz',)

Notice that the numeric reference to the group is automatic in the context. You just need to use the group variable along your next statements or arguments to make it work accordingly.

The first variable reference of a group corresponds to the group construction itself, the second variable reference will be serialized to the regex format \num where num is the group index in the context of the regex.

NamedGroup

Named groups are useful to keep track of specific patterns that were matched. In crocs you can reference a named group in other regex pattern. It allows you to better debug your regex's.

from crocs.regex import Pattern, NamedGroup, X
e = Pattern('x', NamedGroup('foo', X()))
e.test()
e.hits()

Would output:

Regex: x(?P<foo>.)
Input: xo
Group dict: {'foo': 'o'}
Group 0: xo
Groups: ('o',)
e.hits()

Match with:
 xK xS xt x{ x7 x3 xv xE xu xU

References for named groups work alike for numeric groups.

e0 = NamedGroup('beta', 'X', X(), 'B')
e1 = Pattern('um', e0, 'dois', e0, 'tres', e0)

e1.test()
e1.hits()

Input: 'umXdBdoisXdBtresXdB' Regex: um(?PX.B)dois(?P=beta)tres(?P=beta) Group dict: {'beta': 'XdB'} Group 0: umXdBdoisXdBtresXdB Groups: ('XdB',)

e1.hits() Match with: umX8BdoisX8BtresX8B umXCBdoisXCBtresXCB umX>BdoisX>BtresX>B umX1BdoisX1BtresX1B umXBdoisXBtresX~B umX=BdoisX=BtresX=B umXMBdoisXMBtresXMB


## ConsumeNext

~~~python
from crocs.regex import Pattern, ConsumeNext, X

e = ConsumeNext(Pattern('a', X(), 'b'), 'def')
e.test()
e.hits()

The ConsumeNext constructor accepts the keyword neg which can define a positive or negative lookahead assertion.

It is also important to notice that the Pattern class can be nested to build patterns.

That would output:

Regex: (?<=a.b)def
Input: ambdef
Group dict: {}
Group 0: def
Groups: ()
Match with:
 aUbdef a@bdef ambdef a=bdef a&bdef ambdef a0bdef aMbdef a1bdef aIbdef

ConsumeBack

This is a negative lookbehad. It also accepts a neg argument as it is shown below.

from crocs.regex import Pattern, ConsumeBack

e = ConsumeBack('Isaac ', 'Asimov', neg=True)
e.test()
e.hits()

That would output:

Regex: Isaac\ (?!Asimov)
Input: Isaac e<&c2)
Group dict: {}
Group 0: Isaac 
Groups: ()
Match with:
 Isaac )nPSNn Isaac e>}@cC Isaac R(+SHX Isaac 5RK~^X 
Isaac +2'b0- Isaac QWCa%k Isaac $ZDc9j

OneOrMore

The OneOrMore class corresponds to the regex op +.

from crocs.regex import Pattern, X, OneOrMore
repeat = OneOrMore('a')
repeat.test()

Would output:

Input: 'aaaaaa'
Regex: a+
Group dict: {}
Group 0: aaaaaa
Groups: ()

The OneOrMore works alike Repeat it doesn't accept strings lengthier than 1 nor instances of Any or Pattern. The reason is to avoid ambiguuity as explained before.

ZeroOrMore

The ZeroOrMore corresponds to the regex op *.

from crocs.regex import Pattern, X, ZeroOrMore
repeat = ZeroOrMore('a')
repeat.test()

That would output:

Input: 'aaa'
Regex: a*
Group dict: {}
Group 0: aaa
Groups: ()

OneOrZero

The OneOrZero corresponds to the regex op ?

from crocs.regex import Pattern, X, OneOrZero
repeat = OneOrZero('a')
repeat.test()

That would output:

Input: 'a'
Regex: a?
Group dict: {}
Group 0: a
Groups: ()

Note: The classes Repeat, OnrOrZero, OneOrMore, ZeroOrMore all accept a greedy boolean argument to mean if the operator would behave in a greedy manner.