Bug 2878 - CGPDFDictionary.GetString() returns incorrect result if there are (German) Umlauts in the string
Summary: CGPDFDictionary.GetString() returns incorrect result if there are (German) Um...
Status: RESOLVED NORESPONSE
Alias: None
Product: iOS
Classification: Xamarin
Component: XI runtime ()
Version: 5.0
Hardware: Macintosh Mac OS
: --- normal
Target Milestone: Untriaged
Assignee: Sebastien Pouliot
URL:
Depends on:
Blocks:
 
Reported: 2012-01-13 07:33 UTC by René Ruppert
Modified: 2013-12-05 18:33 UTC (History)
2 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
PDF with Umlauts in the bookmarks (780.01 KB, application/pdf)
2012-01-13 08:20 UTC, René Ruppert
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED NORESPONSE

Description René Ruppert 2012-01-13 07:33:57 UTC
Reading the table of contents from a PDF returns incorrect results if the entries contain ö, ä or ü.
The result is a character value of 65533 for all of these. 
Looks like the PDF Strings are not entirely encoded UTF8.
Comment 1 René Ruppert 2012-01-13 07:38:52 UTC
I found some PDF code on the web and it seems that the encoding can also be Unicode:


Found in code shown here: http://opensource.apple.com/source/WebKit/WebKit-7533.16/mac/WebView/WebPDFDocumentExtras.mm

NSStringEncoding encoding = (length > 1 && bytes[0] == 0xFE && bytes[1] == 0xFF) ? NSUnicodeStringEncoding : NSUTF8StringEncoding;
Comment 2 René Ruppert 2012-01-13 07:55:08 UTC
I remember that you are using an unsafe{} block to convert CGPDFStringRef into string. I found some code that can do it without the unsafe but it is suffering from the same problem with the Umlauts. However it shows that in the variable "data", the Umlauts are still correct. Only when converting data into an NSString, they get lost. In data, the "ä" for instance still has the value 228 (E4).
Maybe this code helps to find the issue.

public static bool PDFDictionaryGetString ( IntPtr handle, string key, out string result )
		{
			IntPtr stringPtr;
			result = null;

			if ( CGPDFDictionaryGetString ( handle, key, out stringPtr ) )
			{

				if ( stringPtr == IntPtr.Zero )
					return false;

				// Get length of PDF String
				uint n = ( uint ) CGPDFStringGetLength ( stringPtr );

				// Get the pointer of the string
				var ptr = CGPDFStringGetBytePtr ( stringPtr );
				// Get the bytes
				var data = NSData.FromBytes ( ptr, n );
				// Convert to UTF8
				var value = NSString.FromData ( data, NSStringEncoding.UTF8 );
				var value2 = NSString.FromData ( data, NSStringEncoding.Unicode);

				result = value.ToString ();
				return true;
			}
			return false;
		}
Comment 3 Sebastien Pouliot 2012-01-13 08:02:39 UTC
good catch!
Comment 4 René Ruppert 2012-01-13 08:04:50 UTC
And more info. If (in the code above) I use "NSStringEncoding.ASCIIStringEncoding" instead of UTF8, I get the correct German Umlauts.
So: resolved if you change that? Or will it break other situations?
Comment 5 Sebastien Pouliot 2012-01-13 08:16:12 UTC
It will definitively break things, ASCII is too limited. I'll check if CGPDFStringCopyTextString is smarter about encodings or if we need to add some encoding detection inside MonoTouch.

Can you provide me with your PDF that contains the German Umlauts (and other) characters ?
Comment 6 René Ruppert 2012-01-13 08:20:22 UTC
Created attachment 1181 [details]
PDF with Umlauts in the bookmarks
Comment 7 René Ruppert 2012-01-13 08:20:52 UTC
I attached a PDF with 3 bookmarks. The last bookmark is "äöü" (use Adobe Reader)
Comment 8 Sebastien Pouliot 2012-01-13 09:30:09 UTC
Using CGPDFStringCopyTextString seems to work, I'll look if it regress other cases (e.g. #975).

I assume you are on 5.1.1 (from #991) ? if so I can provide you an updated monotouch.dll which will have this fix (and, for sure, CGPDFArray.GetString), once my tests are done, for further testing.
Comment 9 Sebastien Pouliot 2012-01-13 11:11:23 UTC
Fixed in d94dc8a704fe55e46f120694e6f1a70e9d7e1eff (maccore master).
https://github.com/mono/maccore/commit/d94dc8a704fe55e46f120694e6f1a70e9d7e1eff

The PDF attachment works when using this:

		void Test ()
		{
			var doc = CGPDFDocument.FromFile ("/Users/poupou/Downloads/Document Collection-1.pdf");
			var cat = doc.GetCatalog ();
			cat.Apply (Process);
		}

		void Process (string name, object obj)
		{
			Console.WriteLine ("{0} : {1}", name, obj);
			var dict = obj as CGPDFDictionary;
			if (dict != null) {
				switch (name) {
				case "Parent":
				case "Trans":
				case "Next":
					break;
				default:
					dict.Apply (Process);
					break;
				}
			}
		}

and #975 still works with the change.

Let me know if you want an updated assembly to run your own tests (otherwise the fix will be available in 5.1.3).
Comment 10 René Ruppert 2012-01-13 15:18:01 UTC
I'll add an external to my code and use it to show the table of contents. WIll then wait for MT 5.1.3.
Comment 11 René Ruppert 2012-01-16 04:05:54 UTC
Hi Sebastien, I tried adding the external reference but unfortunately you are using CFString() and that is inaccessible because of the protection level. Can you attach Monotouch binary? I will have to use that binary for the distribution version too. Is that okay?
Comment 12 René Ruppert 2012-01-16 04:09:16 UTC
I am referring to this constructor (which btw I think is incorrect. If "owns" is FALSE, the string gets retained. Shouldn't it be the other way round?

internal CFString (IntPtr handle, bool owns)
		{
			this.handle = handle;
			if (!owns)
			{
				CFObject.CFRetain (handle);
			}
		}
Comment 13 René Ruppert 2012-01-16 04:21:27 UTC
And another unknown: If I use your version and just call the constructor that sets "owns" false, I don't get the Umlauts correctly. It's working with ASCII encoding only.
Comment 14 Sebastien Pouliot 2012-01-16 08:07:00 UTC
Please attach a test case with the code you using and tell me what (default) language you're using on your Mac (some API behave differently).
Comment 15 Sebastien Pouliot 2012-01-16 09:47:04 UTC
note: about comment #12 the .ctor use is correct wrt the API being used. If you want a longer explanation then just ask on stackoverflow.com ;-) as it's not related to the bug and the answer could prove useful to other people as well.
Comment 16 PJ 2013-11-19 17:03:57 UTC
This bug has been in the NEEDINFO state with no changes for the last 90 days. Can we put this back into the NEW or CONFIRMED state, or are we still awaiting response?

If there is no change in the status of this bug over the next two weeks, this bug will be marked as NORESPONSE.
Comment 17 PJ 2013-12-05 18:33:58 UTC
This bug has not been changed from the NEEDINFO state since my previous comment, marking as RESOLVED NORESPONSE.

Please feel free to REOPEN this bug at any time if you are still experiencing the issue. Please add the requested information and set the bug back to the NEW (or CONFIRMED) state.