[Subtitles] OCR translates ? by 7, most of the time

Begonnen von sinbad21, August 24, 2016, 20:16:48

« vorheriges - nächstes »

sinbad21

Hi,

I'm very satisfied with the version 2 of Ts-Doctor. The OCR is very good to recognize french dialogs in subtitles, except a little bug, that I have to correct manually : the ? sign is translated by the number 7, most of the time (on the french channels).

I don't know if this bug occurs in other countries.

Djfe

if you could provide a sample recording for cypheros (doesn't have to be more than a few minutes, it just needs to contain the question mark ;) ), then he might be able to improve the OCR a bit so that it will work better with french channels in the future :)

Mam

Zitat von: Djfe am August 28, 2016, 15:50:57
if you could provide a sample recording for cypheros (doesn't have to be more than a few minutes, it just needs to contain the question mark ;) ), then he might be able to improve the OCR a bit so that it will work better with french channels in the future :)

No, totally bad idea!
He can do it himself by downloading and installing the complete engine, training it to his french demands and copying over the results to the doc (and sharing it too with cypheros, so he can include it into the next version). I think, its rather impossible for him to spend the time for this and to decide what is right and what is wrong (besides "?" and "7", french really is not an easy or logical language if you compare spoken text to written and vice versa.)
And then the next will come up and ask for Sanskirit or Farsi.
Thats far beyond of what he can support.

So, look for "tesseract", its a public domain OCR engine with a small gui. The doc only comes with the engine, but for training you need to install the full version.
Run it, feed some of your problematic files through it, you can manually change your ? there and it saves the tuning files for you automatically. From then on it will handle your ? correctly. Then you can copy over the tuning file to the doc's folder and he will also include your changes.

sinbad21

Zitat von: Djfe am August 28, 2016, 15:50:57
if you could provide a sample recording for cypheros (doesn't have to be more than a few minutes, it just needs to contain the question mark ;) ), then he might be able to improve the OCR a bit so that it will work better with french channels in the future :)
Thank you for your response. Look at this example, lines 4 and 5 there is a "7" in place of "?". In the entire file, the same error occurs 157 times :

1
00:01:15,839 --> 00:01:17,199
-Monsieur, bonjour.

2
00:01:17,199 --> 00:01:20,199
-Bonjour. Il doit y avoir
une réservation au nom de M. Swann.

3
00:01:22,159 --> 00:01:25,159
-En effet, monsieur.

4
00:01:26,679 --> 00:01:29,679
-Je peux vous payer en liquide 7
-Bien sûr, monsieur Swann.

5
00:01:35,199 --> 00:01:38,199
Vous êtes à Paris pour affaires 7
-Non. Pour dormir.

6
00:01:40,599 --> 00:01:43,599
Merci.

Cypheros

Did you check the new beta with the better OCR engine?

sinbad21

Zitat von: Cypheros am Juni 04, 2017, 11:42:40
Did you check the new beta with the better OCR engine?
Good news, the problem is solved with the beta ;)  But the OCR is much slower, isn't it ? About 12x slower. I have a warning in log : "Quartz.dll warning: incomplete Wine implementation". Maybe it comes from my Wine implementation.

Cypheros

Yes, high precisely OCR is slower. Try to use the normal OCR, it is not that good but better as the old version. If you use a VM, try to activate more CPUs as OCR is now multithreaded.

Yes quartz.dll warning is caused by Wine. What Wine version, do you use?

sinbad21

Zitat von: Cypheros am Juni 04, 2017, 18:57:53
Yes, high precisely OCR is slower. Try to use the normal OCR, it is not that good but better as the old version. If you use a VM, try to activate more CPUs as OCR is now multithreaded.

Yes quartz.dll warning is caused by Wine. What Wine version, do you use?
Wine version : 1.8-rc4 (for Mac). High-precision is not checked by default, and I did not change it. I also tried in VMware, it's fast. The problem of slowness is only in Wine.

Cypheros

#8
How many CPU cores do you have?

For me OCR is more than 4 times faster than before with 6 core Intel I7-5820K.

sinbad21

Zitat von: Cypheros am Juni 05, 2017, 00:15:16
How many CPU cores do you have?

For me OCR is more than 4 times faster than before with 6 core Intel I7-5820K.
I have an I7 with 4 cores, but as I said before the problem of slowness is when I launch Ts-Doctor in Wine (with a kind of Windows XP I expect). I have a bootcamp partition with Windows 10 and when I boot in Windows, yes, it is fast. If I boot in Windows via VMWare, it is also fast. But with Wine, the OCR step is very very slow (only the OCR step). And with the release of TS-Doctor (2.0.71) it is not slow, even in Wine.

Cypheros

Try to disable mutithreading for OCR. Maybe the new mulithreading is not working well under Wine.

sinbad21

Zitat von: Cypheros am Juni 05, 2017, 15:22:09
Try to disable mutithreading for OCR. Maybe the new mulithreading is not working well under Wine.
That is the solution ! It works perfectly now. Thank you  :)


www.cypheros.de