Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core (LV::Video) Fix alpha blending of 32-bit videos (#230) #244

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kaixiong
Copy link
Member

@kaixiong kaixiong commented Feb 2, 2023

This is a rewrite of the buggy 32-bit LV::Video alpha blending code to deal with arithmetic underflows/overflows and the use of an uninitialized register in the MMX implementation (#230).

Take note that GCC/Clang x86-64 (recent only?) produces SSE instructions instead of MMX. GCC 12.2 uses the XMM registers while Clang sticks to MM but throwing in the use of pshuflw.

Perhaps it's time to move on and use SSE2 (introduced in 2000 to Pentium 4s) to work with 2 pixels at once. Or maybe even 4 pixels at once. This will require larger memory alignments and complicate the code a bit more to work with non-divisible row widths.

Here is a link to the plain C and SIMD code in Godbolt.

@kaixiong kaixiong self-assigned this Feb 2, 2023
@kaixiong kaixiong added this to the 0.5.0_alpha1 milestone Feb 2, 2023
@kaixiong kaixiong marked this pull request as ready for review February 5, 2023 02:26
Copy link
Member

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong I think despite the CI being green I found two bugs in the new version of blit_overlay_alphasrc. Happy to learn I overlooked or misunderstood something, let's see. Interesting stuff!

Comment on lines +81 to +87
uint16_t const c0 = static_cast<uint16_t> (src_pixel[0]) * src_alpha + static_cast<uint16_t> (dst_pixel[0]) * (255 - src_alpha);
uint16_t const c1 = static_cast<uint16_t> (src_pixel[1]) * src_alpha + static_cast<uint16_t> (dst_pixel[1]) * (255 - src_alpha);
uint16_t const c2 = static_cast<uint16_t> (src_pixel[2]) * src_alpha + static_cast<uint16_t> (dst_pixel[2]) * (255 - src_alpha);

destbuf[0] = (alpha * (srcbuf[0] - destbuf[0]) >> 8) + destbuf[0];
destbuf[1] = (alpha * (srcbuf[1] - destbuf[1]) >> 8) + destbuf[1];
destbuf[2] = (alpha * (srcbuf[2] - destbuf[2]) >> 8) + destbuf[2];
dst_pixel[0] = c0 >> 8;
dst_pixel[1] = c1 >> 8;
dst_pixel[2] = c2 >> 8;
Copy link
Member

@hartwork hartwork Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaixiong ,

the TL;DR version would be that I think that:

  • The >> 8 in dst_pixel[0] = c0 >> 8; and siblings is too much
  • The cast to 16 bits needs to happen elsewhere, i.e. not cast(a * b + cast(c * d)) but cast(a * b) + cast(c * d). EDIT: mis-read the code, nevermind

Let me go into more detail on former:

My impression is that this tries to implement approach "A over B" at https://en.wikipedia.org/wiki/Alpha_compositing#Description , formalized there as:

A over B

a_0 := a_a + a_b * (1.0 - a_a)
C_0 := (C_a * a_a + C_b * a_b * (1.0 - a_a)) / a_0

Whereas for us:

A := src
B := dst

With a_a and a_b ranging from 0 to 255 rather than 0.0 to 1.0 I end up with these formulas for us, note the output range annotation.

a_0 := a_a/255 + a_b/255 * (1.0 - a_a/255))  # range 0..1
    == a_a/255 + a_b/255 * (255/255 - a_a/255))
    == (a_a + a_b * (255/255 - a_a/255)))/ 255
    == (a_a + a_b * (255 - a_a) / 255) / 255  # range 0..1
    => (a_a + a_b * (255 - a_a) / 255)        # range 0..255

C_0 := (C_a * a_a/255 + C_b * a_b/255 * (1.0 - a_a/255)) / (a_0/255)  # range 0..1
    == (C_a * a_a/255 + C_b * a_b/255 * (1.0 - a_a/255)) / a_0 * 255
    == (C_a * a_a/255 + C_b * a_b/255 * (255/255 - a_a/255)) / a_0 * 255
    == (C_a * a_a/255 + C_b * a_b/255 * ((255 - a_a) / 255)) / a_0 * 255  # range 0..1
    => (C_a * a_a     + C_b * a_b     * ((255 - a_a) / 255)) / a_0 * 255  # range 0..255

Now the current code in the pull request seems to assume that the dst image has an alpha channel with full 255 opacity a_b for all pixels. I'd like to understand why and would like to suggest addition of a comment to the code, but I'll take it for granted below for a moment.

So Inserting a_b := 255 I get …

a_0 := (a_a + a_b * (255 - a_a) / 255)
    == (a_a + 255 * (255 - a_a) / 255)
    == (a_a + (255 - a_a))
    == 255

C_0 := (C_a * a_a + C_b * a_b * ((255 - a_a) / 255)) / a_0 * 255
    == (C_a * a_a + C_b * 255 * ((255 - a_a) / 255)) / 255 * 255
    == (C_a * a_a + C_b *       ((255 - a_a)      ))
    == (src_pixel[0] * src_alpha + dst_pixel[0] * ((255 - src_alpha)))

Due to a_0 == 255 then dst_pixel[3] does not need to be written, confirmed.
However, C_0 is already in range 0..255 so the additional division in dst_pixel[0] = c0 >> 8; seems to be to much and should yield in an "almost black" picture, if I am not mistaken.

I'm happy to learn what I missed or to take this to a call, e.g. if my side here was hard to understand.

Best, Sebastian

Copy link
Member Author

@kaixiong kaixiong Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hartwork, just to be clear this is only meant to be a fix of the original code. The operation here doesn't actually implement 'over' (or 'atop'), it's a simple linear interpolation between the source and target colours, using the source alpha as parameter. The target alpha is unchanged.

At some point, I want to have alpha compositing with the complete set of operators but that'll have to come later. There are additional API design considerations in there due to the necessity of premultiplied alpha, which the Wikipedia article mentions.

Regarding the actual calculation itself, the right-shift by 8 bits is necessary. Alpha is a percentage i.e. value in [0.0, 1.0] that's been mapped to [0, 255]. So what we're calculating here is really:

c = c1 * alpha/255 + c2 * (1 - alpha/255)

Multiplying throughout by 255 to work with only integers, we have:

c * 255 = c1 * alpha + c2 * (255 - alpha)

To get back c, we divide by 255 in the final step. Here >> 8 was originally chosen for performance reasons (as the test mentions), so I kept it as it is.

Since c1, c2 and alpha each have a maximum value of 255 and their products are involved, we need to work in at least uint16_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong thanks for elaborating, let me digest that more.

@kaixiong
Copy link
Member Author

kaixiong commented Feb 8, 2024

@hartwork, any chance you could look at this again?

@hartwork
Copy link
Member

@hartwork, any chance you could look at this again?

@kaixiong I hope to find time to, in the coming days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants