Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error decode UTF-8 character 'â' #18

Open
Trantamhvbc opened this issue Oct 11, 2023 · 11 comments
Open

Error decode UTF-8 character 'â' #18

Trantamhvbc opened this issue Oct 11, 2023 · 11 comments

Comments

@Trantamhvbc
Copy link

Trantamhvbc commented Oct 11, 2023

I have a problem when I try using pyzbar to decode a QR image. But I had given result don't match data which I using qrcode make before.
this is my code:

from qreader import QReader
from PIL import Image
import qrcode

image_path = "my_image.png"
data = 'â'
print(f'data = {data}')
img = qrcode.make(data)

img.save(image_path)
img = cv2.imread(image_path)
result = qreader.detect_and_decode(image=img)
print(f"result = {result[0]}")

@Eric-Canas
Copy link
Owner

Eric-Canas commented Oct 11, 2023

Hi, how are you initializing qreader here? qreader.detect_and_decode(image=img)

I have run your piece of code, just instantiating it as qreader = QReader() and gives me the correct result.

data = â
result = â

I have been exploring with the debugger and I have detected that, intermediately, pyzbar decodes an incorrect character ('テ「') with utf-8
image
However, when you instantiate QReader with its default reencode_to value, it automatically solves it:

image image

I think that it should only fail to decode that character if you initialize it as QReader(reencode_to='utf-8') or QReader(reencode_to=None).

If that's not the case, could you give me more information to try to replicate the error?

  • Are you running latest version?
  • Which OS are you running?

@Trantamhvbc
Copy link
Author

Trantamhvbc commented Oct 12, 2023

Hi, @Eric-Canas I am using

  • OS Ubuntu 22.04.1 LTS.
  • qreader 3.11
  • python 3.10.12

This is my result :
image

@Eric-Canas
Copy link
Owner

I have been trying to replicate the error in Windows, Amazon Linux and Ubuntu 22.04, and I have not been able to reproduce it :(

The error should be replicable by running:

>>> 'テ「'.encode('shift-jis').decode('utf-8')
'â'

Does this code also breaks for you?

(Amazon Linux 2023)
image

(Ubuntu)
image

My best guess is that It must be related with regional configuration of the OS, but I can not ensure that as I have not been able to replicate the error :(

The problem is related to how python encode and decode plain strings with special characters. As that's the line that is giving you the warning:

'テ「'.encode('shift-jis').decode('utf-8')
image

@Trantamhvbc
Copy link
Author

Trantamhvbc commented Oct 13, 2023

I have trying my code in the google colab and given result the same on my computer.

image

And I have checked result (b'\x8e\xa3' ) of pyzbar my program had different your result (b'\xc3\xa2') :
image

@Eric-Canas
Copy link
Owner

Hi!

Sorry for the inconvenience, I oversimplified the error. I have been researching it thanks to your Google Colab, and I found that problem was that Windows and Linux does not use the same decoding. So, while default "utf-8" pyzbar decoding was 'テ「' for Windows, it was '璽' for Linux.

I did a large experimentation of shift-jis vs other encodings, and "Big5" is the one that gave me the correct decoding results for all characters on Linux systems, as shift-jis was for Windows systems (It gives same decoding that shift-jis for all cases where shift-jis works, and correct results for those cases where it fails on Linux).

I have uploaded an update that selects one or the other encoding as default, depending on your OS ("Big5" fails on a lot of characters on Windows :( ). I have tested it on your Google Colab, and that's producing expected results now.

You can upgrade it by pip install --upgrade qreader. Previous version should still work if you instantiate QReader as QReader(reencode_to="big5")

Thanks a lot for your warning!

@Trantamhvbc
Copy link
Author

Hi @Eric-Canas, I have checked your solution and one that gave correct decoding results on my computer.
Thanks your supporting.

@Trantamhvbc
Copy link
Author

Trantamhvbc commented Oct 17, 2023

Hi @Eric-Canas ,

I have check QReader(reencode_to="big5") with character 'â' then gaven correct result. When i have checked lagre data with QReader(reencode_to="big5") then I have many same error.
there my code anh data :

import json

from qreader import QReader
from PIL import Image
import qrcode
import cv2

image_path = "my_image.png"

qreader = QReader(model_size='n',reencode_to='big5')
json_file = open('uit_member.json', 'r')
data = json.load(json_file)
j = 0
len_ = 0

for i in data:
len_ += 1
name = i["full_name"]
img = qrcode.make(name)
img.save(image_path)
img = cv2.imread(image_path)
result = qreader.detect_and_decode(image=img)
if name != result[0]:
j+= 1
print(f"{j*100/len_}% data {name} result = {result[0]} ")

@Eric-Canas
Copy link
Owner

Hi!

Thanks for your test data. I'm still testing, it seems that there are some entries quite difficult to decode. By the moment I can tell you that most of your errors should dissapear this way:

QReader(reencode_to=('big5', 'shift-jis', 'latin1'))

But not all of them.

To easily replicate the error, there should be a way to decode
b'L\xef\xbe\x83\xef\xbd\xaa Anh S\xef\xbe\x86\xef\xbd\xa1n'
as
Lê Anh Sơn

But I can't find any charset that works. That's the direct byte detection pyzbar gets from the qr generated by qrcode for this entry. And I can't find any single nor double encoding way of decoding it correctly.

Sorry, I'll update you if a find an alternative.

@tranvannhat
Copy link

tranvannhat commented Oct 24, 2023

Hi, i same issue.
Actually the phrase in my QR is:
Vĩnh Phong, Vĩnh Bảo, Hải Phòng
When using the library I get:
V藺nh Phong, V藺nh B廕υ, H廕ξ Ph簷ng

@congdaoduy298
Copy link

Hi, Did someone solve this problem or have any approach to handle this case ? Thank you!

@quyet12308
Copy link

Hello i have a same issue .
When i scan qrcode on my card id .
the correct text must be : HUỲNH HIẾU THUẬN
but the text I received was : Hu廙軟h Hi廕簑 Thu廕要
I have read your documentation and edited the reencode_to parameters but it doesn't seem to work for me.
My languages is vietnamese
And this is my code:

def read_qr_code_2(image_path):
    # Create a QReader instance
    qreader = QReader(model_size = 's', min_confidence = 0.5, reencode_to = 'utf-8')

    # Get the image that contains the QR code
    image = cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB)

    # Use the detect_and_decode function to get the decoded QR data
    decoded_text = qreader.detect_and_decode(image=image)
    print(decoded_text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants