Error decode UTF-8 character 'â' #18

Trantamhvbc · 2023-10-11T09:54:34Z

I have a problem when I try using pyzbar to decode a QR image. But I had given result don't match data which I using qrcode make before.
this is my code:

from qreader import QReader
from PIL import Image
import qrcode

image_path = "my_image.png"
data = 'â'
print(f'data = {data}')
img = qrcode.make(data)

img.save(image_path)
img = cv2.imread(image_path)
result = qreader.detect_and_decode(image=img)
print(f"result = {result[0]}")

Eric-Canas · 2023-10-11T17:33:27Z

Hi, how are you initializing qreader here? qreader.detect_and_decode(image=img)

I have run your piece of code, just instantiating it as qreader = QReader() and gives me the correct result.

data = â
result = â

I have been exploring with the debugger and I have detected that, intermediately, pyzbar decodes an incorrect character ('ﾃ｢') with utf-8

However, when you instantiate QReader with its default reencode_to value, it automatically solves it:

I think that it should only fail to decode that character if you initialize it as QReader(reencode_to='utf-8') or QReader(reencode_to=None).

If that's not the case, could you give me more information to try to replicate the error?

Are you running latest version?
Which OS are you running?

Trantamhvbc · 2023-10-12T00:54:10Z

Hi, @Eric-Canas I am using

OS Ubuntu 22.04.1 LTS.
qreader 3.11
python 3.10.12

This is my result :

Eric-Canas · 2023-10-12T09:02:46Z

I have been trying to replicate the error in Windows, Amazon Linux and Ubuntu 22.04, and I have not been able to reproduce it :(

The error should be replicable by running:

>>> 'ﾃ｢'.encode('shift-jis').decode('utf-8')
'â'

Does this code also breaks for you?

(Amazon Linux 2023)

(Ubuntu)

My best guess is that It must be related with regional configuration of the OS, but I can not ensure that as I have not been able to replicate the error :(

The problem is related to how python encode and decode plain strings with special characters. As that's the line that is giving you the warning:

'ﾃ｢'.encode('shift-jis').decode('utf-8')

Trantamhvbc · 2023-10-13T02:30:51Z

I have trying my code in the google colab and given result the same on my computer.

And I have checked result (b'\x8e\xa3' ) of pyzbar my program had different your result (b'\xc3\xa2') :

Eric-Canas · 2023-10-13T09:46:05Z

Hi!

Sorry for the inconvenience, I oversimplified the error. I have been researching it thanks to your Google Colab, and I found that problem was that Windows and Linux does not use the same decoding. So, while default "utf-8" pyzbar decoding was 'ﾃ｢' for Windows, it was '璽' for Linux.

I did a large experimentation of shift-jis vs other encodings, and "Big5" is the one that gave me the correct decoding results for all characters on Linux systems, as shift-jis was for Windows systems (It gives same decoding that shift-jis for all cases where shift-jis works, and correct results for those cases where it fails on Linux).

I have uploaded an update that selects one or the other encoding as default, depending on your OS ("Big5" fails on a lot of characters on Windows :( ). I have tested it on your Google Colab, and that's producing expected results now.

You can upgrade it by pip install --upgrade qreader. Previous version should still work if you instantiate QReader as QReader(reencode_to="big5")

Thanks a lot for your warning!

Trantamhvbc · 2023-10-16T01:17:33Z

Hi @Eric-Canas, I have checked your solution and one that gave correct decoding results on my computer.
Thanks your supporting.

Trantamhvbc · 2023-10-17T09:57:41Z

Hi @Eric-Canas ,

I have check QReader(reencode_to="big5") with character 'â' then gaven correct result. When i have checked lagre data with QReader(reencode_to="big5") then I have many same error.
there my code anh data :

import json

from qreader import QReader
from PIL import Image
import qrcode
import cv2

image_path = "my_image.png"

qreader = QReader(model_size='n',reencode_to='big5')
json_file = open('uit_member.json', 'r')
data = json.load(json_file)
j = 0
len_ = 0

for i in data:
len_ += 1
name = i["full_name"]
img = qrcode.make(name)
img.save(image_path)
img = cv2.imread(image_path)
result = qreader.detect_and_decode(image=img)
if name != result[0]:
j+= 1
print(f"{j*100/len_}% data {name} result = {result[0]} ")

Eric-Canas · 2023-10-18T08:08:47Z

Hi!

Thanks for your test data. I'm still testing, it seems that there are some entries quite difficult to decode. By the moment I can tell you that most of your errors should dissapear this way:

QReader(reencode_to=('big5', 'shift-jis', 'latin1'))

But not all of them.

To easily replicate the error, there should be a way to decode
b'L\xef\xbe\x83\xef\xbd\xaa Anh S\xef\xbe\x86\xef\xbd\xa1n'
as
Lê Anh Sơn

But I can't find any charset that works. That's the direct byte detection pyzbar gets from the qr generated by qrcode for this entry. And I can't find any single nor double encoding way of decoding it correctly.

Sorry, I'll update you if a find an alternative.

tranvannhat · 2023-10-24T08:19:18Z

Hi, i same issue.
Actually the phrase in my QR is:
Vĩnh Phong, Vĩnh Bảo, Hải Phòng
When using the library I get:
V藺nh Phong, V藺nh B廕υ, H廕ξ Ph簷ng

congdaoduy298 · 2024-01-30T02:52:11Z

Hi, Did someone solve this problem or have any approach to handle this case ? Thank you!

quyet12308 · 2024-03-08T08:54:04Z

Hello i have a same issue .
When i scan qrcode on my card id .
the correct text must be : HUỲNH HIẾU THUẬN
but the text I received was : Hu廙軟h Hi廕簑 Thu廕要
I have read your documentation and edited the reencode_to parameters but it doesn't seem to work for me.
My languages is vietnamese
And this is my code:

def read_qr_code_2(image_path):
    # Create a QReader instance
    qreader = QReader(model_size = 's', min_confidence = 0.5, reencode_to = 'utf-8')

    # Get the image that contains the QR code
    image = cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB)

    # Use the detect_and_decode function to get the decoded QR data
    decoded_text = qreader.detect_and_decode(image=image)
    print(decoded_text)

Trantamhvbc closed this as completed Oct 16, 2023

Trantamhvbc reopened this Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error decode UTF-8 character 'â' #18

Error decode UTF-8 character 'â' #18

Trantamhvbc commented Oct 11, 2023 •

edited

Loading

Eric-Canas commented Oct 11, 2023 •

edited

Loading

Trantamhvbc commented Oct 12, 2023 •

edited

Loading

Eric-Canas commented Oct 12, 2023

Trantamhvbc commented Oct 13, 2023 •

edited

Loading

Eric-Canas commented Oct 13, 2023

Trantamhvbc commented Oct 16, 2023

Trantamhvbc commented Oct 17, 2023 •

edited

Loading

Eric-Canas commented Oct 18, 2023

tranvannhat commented Oct 24, 2023 •

edited

Loading

congdaoduy298 commented Jan 30, 2024

quyet12308 commented Mar 8, 2024

Error decode UTF-8 character 'â' #18

Error decode UTF-8 character 'â' #18

Comments

Trantamhvbc commented Oct 11, 2023 • edited Loading

Eric-Canas commented Oct 11, 2023 • edited Loading

Trantamhvbc commented Oct 12, 2023 • edited Loading

Eric-Canas commented Oct 12, 2023

Trantamhvbc commented Oct 13, 2023 • edited Loading

Eric-Canas commented Oct 13, 2023

Trantamhvbc commented Oct 16, 2023

Trantamhvbc commented Oct 17, 2023 • edited Loading

Eric-Canas commented Oct 18, 2023

tranvannhat commented Oct 24, 2023 • edited Loading

congdaoduy298 commented Jan 30, 2024

quyet12308 commented Mar 8, 2024

Trantamhvbc commented Oct 11, 2023 •

edited

Loading

Eric-Canas commented Oct 11, 2023 •

edited

Loading

Trantamhvbc commented Oct 12, 2023 •

edited

Loading

Trantamhvbc commented Oct 13, 2023 •

edited

Loading

Trantamhvbc commented Oct 17, 2023 •

edited

Loading

tranvannhat commented Oct 24, 2023 •

edited

Loading