Bug 40648

Summary: Regular expression support in comparison for <test>
Product: fontconfig Reporter: Akira TAGOH <akira>
Component: libraryAssignee: fontconfig-bugs
Status: RESOLVED MOVED QA Contact: Behdad Esfahbod <freedesktop>
Severity: enhancement    
Priority: medium CC: freedesktop
Version: 2.8   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Akira TAGOH 2011-09-06 00:41:39 UTC
That would be useful to support the regexp for comparison. this possibly would helps to fix Bug#35118 with:

<match>
  <test name="lang" compare="regex">
    <string>pa.*</string>
  </test>
  ...
</match>

Or Bug#28491 to make the match pattern for the filename only perhaps and Bug#13416 to merge the subfamilies?
Comment 2 Akira TAGOH 2011-09-06 01:14:42 UTC
tested with:


<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
        <match>
                <test name="lang" compare="regex">
                        <string>ja.*</string>
                </test>
                <edit name="family" mode="prepend">
                        <string>DejaVu Sans</string>
                </edit>
        </match>
</fontconfig>

$ FC_DEBUG=4 fc-match :lang=ja
...
Add Subst match
        pattern any lang Regex "ja.*"
edit
        Edit family Prepend "DejaVu Sans";
...
FcConfigSubstitute test pattern any lang Regex "ja.*"
Substitute match
        pattern any lang Regex "ja.*"
edit
        Edit family Prepend "DejaVu Sans";

Prepend list before  "sans-serif"(w)
Prepend list after  "DejaVu Sans"(w) "sans-serif"(w)
FcConfigSubstitute editPattern has 2 elts (size 16)
        family: "DejaVu Sans"(w) "sans-serif"(w)
        lang: ja(s)
...

$ FC_DEBUG=4 fc-match :lang=ja-jp
...
FcConfigSubstitute test pattern any lang Regex "ja.*"
Substitute match
        pattern any lang Regex "ja.*"
edit
        Edit family Prepend "DejaVu Sans";

Prepend list before  "sans-serif"(w)
Prepend list after  "DejaVu Sans"(w) "sans-serif"(w)
FcConfigSubstitute editPattern has 2 elts (size 16)
        family: "DejaVu Sans"(w) "sans-serif"(w)
        lang: ja-jp(s)
Comment 3 Behdad Esfahbod 2011-09-12 10:08:03 UTC
No, this is not the right approach.  What's not working right now?  Lang testing is not string testing.
Comment 4 Akira TAGOH 2011-09-12 21:24:33 UTC
(In reply to comment #3)
> No, this is not the right approach.  What's not working right now?  Lang
> testing is not string testing.

Right. and strictly speaking the above patch doesn't test the string but adding the lang name according to the result of the regexp and check if the pattern matches the string sets. the behavior is more intuitive IMHO.
Comment 5 Akira TAGOH 2011-09-12 21:59:22 UTC
maybe not enough for explanation...

This feature would gives us an easy way to test the multiple lang name. this feature doesn't provide the detailed comparison between FcLangSet that modified a bit for the special case though, that would provide similar functionality when creating FcLangSet against the string.
I'm not sure if this is really useful example, but possibly functionality though, given that there are any requirements to apply something for CJK only, we could do:

<match>
  <test name="lang" compare="regex">
    <string>zh|ja|ko</string>
  </test>
  ....
</match>

say. according to Bug#33644, there are no smart way to do that in fontconfig so far, except having 3 different <match/> rules that isn't really smart.

FWIW the previous comment somewhat contains a false alarm; FcLangSet contains the invalid lang name, it would be same to compare the string as you said. I'm not quite sure what "extra" field in FcLangSet is used for. that could be improved if one can creates the strict FcLangSet when building the pattern.
Comment 6 Akira TAGOH 2013-01-11 08:20:57 UTC
another idea for regexp use case is:

<match>
  <test name="psname" mode="regex">
    <string>(.*)\-(UniJIS\-UTF8\-H)$</string>
  </test>
  <edit name="family" mode="regex_replace">
    <string>\1</string>
  </edit>
  <edit name="pscmap" mode="regex_replace">
    <string>\2</string>
  </edit>
  <edit name="lang" mode="assign">
    <langset><string>ja</string></langset>
  </edit>
</match>

We could have the bunch of rules against CMaps to determine the family name and the lang according to the psname in the pattern.
Comment 7 Behdad Esfahbod 2013-01-14 04:44:36 UTC
(In reply to comment #6)
> another idea for regexp use case is:
> 
> <match>
>   <test name="psname" mode="regex">
>     <string>(.*)\-(UniJIS\-UTF8\-H)$</string>
>   </test>
>   <edit name="family" mode="regex_replace">
>     <string>\1</string>
>   </edit>
>   <edit name="pscmap" mode="regex_replace">
>     <string>\2</string>
>   </edit>
>   <edit name="lang" mode="assign">
>     <langset><string>ja</string></langset>
>   </edit>
> </match>
> 
> We could have the bunch of rules against CMaps to determine the family name
> and the lang according to the psname in the pattern.

This is much harder to implement in the current codebase.

One thing I don't like is matching for things like "pa.*" as that would also match things like "par".  But I guess that can be fixed by a more involved regexp.

I'm not opposing this per se, just pointing out details that need to be taken into consideration.
Comment 8 Akira TAGOH 2013-01-15 02:27:50 UTC
(In reply to comment #7)
> One thing I don't like is matching for things like "pa.*" as that would also
> match things like "par".  But I guess that can be fixed by a more involved
> regexp.

Indeed. it sounds to me like we were trying a bit harder on the lang comparison too. then I feel we shouldn't allowed <test name="lang"><string>xx</string></test>. instead, using <langset><string>xx</string></langset> may be somewhat easy to imagine it's not something can be done by the string operation. then we can make this feature for string specific operation.

FWIW my main interests on this feature is comment#6 now, but not original one. this is more important to implement PostScript related features in bz.
Comment 9 Akira TAGOH 2013-01-17 11:48:00 UTC
Doh, s/in bz/in fontconfig/

How much bad is it to make this feature which is limited to the string? we could have special behavior for lang or charset perhaps on this but it looks like inconsistent behavior and a bit concerned one is getting confused on it.

Another idea is, as some comparison mode fall backs to other mode according to its value type, it can do fall back to eq or so but with warning for lang and charset and so on maybe, which isn't supposed to be the string.
Comment 10 Akira TAGOH 2013-01-17 12:11:12 UTC
another use case that I'm keen to see is:

<match>
  <edit name="family_copy" mode="assign">
    <name>family</name>
  </edit>
</match>
<match>
  <test name="family_copy" mode="regex">
    <string>[[:space:]]</string>
  </test>
  <edit name="family_copy" mode="regex_replace_all">
    <string>-</string>
  </edit>
</match>
<match>
  <test name="psname" mode="eq" qual="all">
    <string></string>
  </test>
  <edit name="psname" mode="assign">
    <name>family_copy</name>
  </edit>
</match>

I'm expecting with the above rules to replace all of the white spaces to '-' and set to 'psname' in the pattern if not available.
Comment 11 Behdad Esfahbod 2013-02-06 21:29:48 UTC
How about defining the regexps Perl-style, ie:

<string>s/( *)/-/g</string>

?
Comment 12 Akira TAGOH 2013-02-07 05:29:03 UTC
(In reply to comment #11)
> How about defining the regexps Perl-style, ie:
> 
> <string>s/( *)/-/g</string>
> 
> ?

Hmm, yeah, it would be easier way to implement this feature, but somewhat not making sense for syntax-wise because it can be done for editing in one line. or shall we add <regex> and allow that in this block only instead of <test compare="regex">?
Comment 13 Behdad Esfahbod 2013-02-07 05:38:01 UTC
One way or other, it's much easier than trying to match "\1" to what a <match> block matched.  Or maybe you have better ideas?
Comment 14 Akira TAGOH 2013-02-07 06:02:30 UTC
Well, that said, the advantage of this way would be that it can be applied to the different objects at the same time in <edit> blocks as I wrote in comment#6. IIUC in your suggestion, it can be applied to the objects only that specified in <test> and need to have similar <string/> lines for other objects right?

it may be somewhat hard to understand how it works without any examples, but IMHO regexp itself is sort of that so the situation can be improved if we have more examples according to the use case.
Comment 15 GitLab Migration User 2018-08-20 21:43:32 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/fontconfig/fontconfig/issues/7.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.