Bug 103237 - FcCacheOffsetsValid() couldn't detect broken cache file
Summary: FcCacheOffsetsValid() couldn't detect broken cache file
Status: RESOLVED FIXED
Alias: None
Product: fontconfig
Classification: Unclassified
Component: library (show other bugs)
Version: 2.12
Hardware: ARM Linux (All)
: medium critical
Assignee: Akira TAGOH
QA Contact: Behdad Esfahbod
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-12 09:13 UTC by Youngbok Shin
Modified: 2017-11-14 11:57 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
The broken cache file. (96.29 KB, application/octet-stream)
2017-10-12 09:13 UTC, Youngbok Shin
Details
another validation (595 bytes, patch)
2017-10-12 10:50 UTC, Akira TAGOH
Details | Splinter Review
Normal (Proper) cache file. (96.29 KB, application/octet-stream)
2017-10-12 11:47 UTC, Youngbok Shin
Details

Description Youngbok Shin 2017-10-12 09:13:44 UTC
Created attachment 134807 [details]
The broken cache file.

I faced an issue caused by broken cache files for fontconfig.
I don't know how fontconfig fails to generate proper caches.
But, somehow, fontconfig generated the broken cache which I attached in this bug report.
As I know, the broken cache file should be detected by FcCache*Valid functions and it should be regenerated.
But, the attached cache file was not caught by the *Valid functions. It caused crash and missing font rendering issues in our platform (Tizen).
The bug raised from fontconfig 2.12.1. I found a patch for FcCacheOffsetsValid() just after releasing 2.12.1.
But, after applying the patch, fontconfig still couldn't detect it properly.

 

As our investigation, font->num was 0 and font->size was 0 in FcCacheOffsetsValid() function with the issue case.
Is this acceptable scenario? Or can I make it return with FcFalse in that condition?
You can test the attached cache file with fc-cat. :)
(You need to change FcCacheTimeValid() function to always return FcTrue
if you want to test this cache file in random environments.)
Comment 1 Akira TAGOH 2017-10-12 10:37:46 UTC
Hm, okay. I can reproduce the crash when bypassing the check of FcCacheTimeValid. though this cache is totally bogus. how did you generate it? fc-cache does create the same one once you remove it? then do you have any patches applied to your fontconfig?

Anyway, FcCacheOffsetsValid will need more checking to detect such bogus caches right. will work on it.
Comment 2 Akira TAGOH 2017-10-12 10:50:12 UTC
Created attachment 134810 [details] [review]
another validation

The attached patch will validate this bogus cache though, I don't think this is really a correct fix for this. we should investigate why/how this was created and fix it.
Comment 3 Youngbok Shin 2017-10-12 11:47:18 UTC
Created attachment 134812 [details]
Normal (Proper) cache file.

I agree with your opinion.
We need to figure out how/why this broken cache was generated.
It should be fixed. And it will be great solution.
But, unfortunately, we couldn't figure out how it was generated.
This broken cache file was found from release binary of our product.
But, it was not created by any processes and fc-cache from engineer binary.

----------------
I will attach proper cache file. You can see this file is longer than the broken one.
The broken one looks like a cache writer is dead while it is being written.

We assumed that it could be happened if device is turned off (power off) while fontconfig is writing the cache. But, not proved.

----------------
Anyway, I'll ask product engineer to test product after applying your patch.
Thank you!
Comment 4 Akira TAGOH 2017-10-12 11:55:41 UTC
cache will be written atomically so even if something happened during writing the cache data into the file, it won't be overwritten by the broken one.
Comment 5 Youngbok Shin 2017-10-12 12:10:58 UTC
I mean what if very first cache writer is dead/stopped during writing cache file?

Anyway, we need to reproduce to create this broken one to verify what is happened. We couldn't reproduce creating the broken one above 2 month. :(
I only have a coredump from the product which had issue.
The cache file and crash callstack were extracted from the coredump.

If I can reproduce it again, I will raise another bug report or write a comment here.
Comment 6 Akira TAGOH 2017-10-13 06:10:35 UTC
(In reply to Youngbok Shin from comment #5)
> I mean what if very first cache writer is dead/stopped during writing cache
> file?

See FcDirCacheWrite. given that this was generated by fontconfig, this breakage happens only when the memory is corrupted.
Comment 7 Youngbok Shin 2017-11-14 10:40:45 UTC
The memory corruption could be one of possible reason. If the memory corruption is the main reason of this, I suppose that we could reproduce it easily or find other serious issues. But, I couldn’t still find memory corruption issues about this. 

I found a write() call in FcDirCacheWrite() is used without calling fsync().
I supposed fsync() is called for better performance. The open-write-close code without fsync() or O_SYNC flag does not make any guarantee that data has been committed to disk. It can cause corrupted cache file when the device is shutdown by power outage. How about your opinion about open-write-close without fsync()?
Comment 8 Akira TAGOH 2017-11-14 11:03:52 UTC
Hmm, does it guarantee the file size?

Anyway, see the difference between the valid and the invalid one. you can see the difference at the earlier place and interestingly the invalid cache has non-zero value. this would means the situation isn't what you expect.
Comment 9 Akira TAGOH 2017-11-14 11:38:28 UTC
For your concern, maybe we can't have any solution for that because it isn't what we can control. we should rather do check carefully and simply drop the invalid cache. in that sense, I may want to merge https://bugs.freedesktop.org/attachment.cgi?id=134810 this proposed change.
Comment 10 Youngbok Shin 2017-11-14 11:39:47 UTC
> Hmm, does it guarantee the file size?
Do you mean fsync() guarantees synchronization without considering its file size?
Then, as I know, yes. It writes a whole buffer to the file descriptor at once. I'll be bad for performance.

I think even if we can see non-zero values at head of invalid cache file, we still need to consider the file corruption by power outage. If a file system couldn't create a whole file or write anything on a file by power outage, we wouldn't call it "corruption".
Here is a Q&A about these scenarios.
https://serverfault.com/questions/403891/journaled-filesystems-and-power-failure


If we can't allow bad performance by calling fsync() for writing the big cache files (to prevent very very rare file corruption issues), I think we need to add more validation code. Just like your temporary patch for the issue.
Comment 11 Youngbok Shin 2017-11-14 11:42:23 UTC
I agree with you. Only we can do for my concern is adding more validation code. Thank you!
Comment 12 Akira TAGOH 2017-11-14 11:57:36 UTC
Okay, merged into master. let's see.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.