Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

DaveCTurner · 2022-02-18T09:39:33Z

The following object apparently cannot be deserialised in CBOR and SMILE:

{"\u0000":"","\u0000\u0000":""}

The following are base64-encoded copies of its representation in various formats:

JSON: eyJcdTAwMDAiOiIiLCJcdTAwMDBcdTAwMDAiOiIifQ==
YAML: LS0tCiJcMCI6ICIiCiJcMFwwIjogIiIK
SMILE: OikKBfqAACCBAAAg+w==
CBOR: v2EAYGIAAGD/

As far as I can tell these all look correct, but on deserialization the field names are reported as duplicates:

com.fasterxml.jackson.core.JsonParseException: Duplicate field '�'
 at [Source: (byte[])":)
���� ��� �"; line: -1, column: 11]

	at __randomizedtesting.SeedInfo.seed([29F6747C61080477:49491DCB06A6B558]:0)
	at com.fasterxml.jackson.core.json.JsonReadContext._checkDup(JsonReadContext.java:204)
	at com.fasterxml.jackson.core.json.JsonReadContext.setCurrentName(JsonReadContext.java:198)
	at com.fasterxml.jackson.dataformat.smile.SmileParser._handleFieldName(SmileParser.java:1346)
	at com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:374)
	at com.fasterxml.jackson.core.JsonGenerator._copyCurrentContents(JsonGenerator.java:1935)
	at com.fasterxml.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:1914)
	at org.elasticsearch.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:396)

Relates elastic/elasticsearch#84146

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2022-02-18T12:57:44Z

No idea how one might fix this but here's a test for it at least (against 45e8780):

diff --git a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
index 26bfd377..a31feba8 100644
--- a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
+++ b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
@@ -478,6 +478,30 @@ public class BasicParserTest extends BaseTestForSmile
         p.close();
     }

+    public void testNULInFieldNames() throws IOException
+    {
+        ByteArrayOutputStream bytes = new ByteArrayOutputStream(1000);
+        JsonGenerator jgen = new SmileFactory().createGenerator(bytes);
+        jgen.writeStartObject();
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            jgen.writeStringField(fieldName, "");
+        }
+        jgen.writeEndObject();
+        jgen.close();
+
+        SmileParser p = _smileParser(bytes.toByteArray());
+
+        assertToken(JsonToken.START_OBJECT, p.nextToken());
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            assertToken(JsonToken.FIELD_NAME, p.nextToken());
+            assertEquals(fieldName, p.getCurrentName());
+            assertToken(JsonToken.VALUE_STRING, p.nextToken());
+            assertEquals("", p.getValueAsString());
+        }
+        assertToken(JsonToken.END_OBJECT, p.nextToken());
+        assertNull(p.nextToken());
+        p.close();
+    }

     public void testBufferRelease() throws IOException
     {

cowtowncoder · 2022-02-19T02:21:19Z

@DaveCTurner first of all, thank you for reporting this and especially contributing a test to reproduce!

I have a suspicion that the issue is with zero-padding of symbol table used; I vaguely recall something similar with JSON (with escaping, null bytes are legal in JSON names too).
But since Smile and CBOR codecs are bit different (with more straight-forward name decoding due to lack of escaping) it is possible that handling has non working edge case.

But the gist of this is that symbol table lookups are keyed by "quads" (int containing up to 4 bytes from name; last quad possibly being incomplete). Special padding is used to avoid conflicts for case of "unused" bits at the end; this only occurs in case of raw 0 bytes.
I forget the exact details of how this was handled in JSON but given time should be able to refresh my memory.

Now: I hope to look into this in near future, but I do have bit of a backlog. I just wanted to add a note saying above. :)

cowtowncoder · 2022-02-26T01:24:06Z

Ok, yes, I can reproduce this. Somehow looks like two-null-byte case gets decoded as single null byte, if I got that right.

cowtowncoder · 2022-02-26T01:55:52Z

Interestingly enough, JSON parser from jackson-core does NOT suffer from this. Could be due to escaping required, maybe decoding goes through different route.

Actually, hmmh. Yes, it goes through different decoder but ends up with similar look up, with a call to _padLastQuad; to fix:

FasterXML/jackson-core#148

and looks like JSON backend has no mismatch. Need to see if something similar would help with CBOR, Smile.

cowtowncoder · 2022-02-26T05:59:19Z

Hmmmh. Ok... this is tricky. Code in Smile, CBOR codecs does not keep track of "unused" bytes in quads, unlike JSON codec. But we do need to distinguish between "unused" and 0 bytes since only latter is part of actual decoded value.
Value of 0xFF (or -1 as signed) byte works because it is not a legal UTF-8 byte; that's what JSON codec uses.

Two ways to go about it, really:

Fill last possibly incomplete quad with -1s when starting (so anything unshifted remains so)
Pad at the end

JSON codec uses (2) for what that is worth.

DaveCTurner · 2022-02-26T08:02:45Z

👍 that ties in with what I thought I was seeing yes. I'd completely forgotten that 0xFF wasn't a legal byte here, but yes that's just what we need. Option 1 seems simpler:

diff --git a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
index ea4fe91c..72560434 100644
--- a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
+++ b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
@@ -1825,7 +1825,7 @@ versionBits));
         if (len < 5) {
             int inPtr = _inputPtr;
             final byte[] inBuf = _inputBuffer;
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {
@@ -1849,7 +1849,7 @@ versionBits));
         q1 =  (q1 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 9) {
-            int q2 = (inBuf[inPtr++] & 0xFF);
+            int q2 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 5;
             if (left > 0) {
                 q2 = (q2 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1871,7 +1871,7 @@ versionBits));
         q2 =  (q2 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 13) {
-            int q3 = (inBuf[inPtr++] & 0xFF);
+            int q3 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 9;
             if (left > 0) {
                 q3 = (q3 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1920,7 +1920,7 @@ versionBits));
         } while ((len -= 4) > 3);
         // and then leftovers
         if (len > 0) {
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {

That fixes it for me, but I see other similar code that might also have the same problem.

cowtowncoder · 2022-02-26T19:03:18Z

@DaveCTurner ah. I was looking at CBOR end of things. There are indeed a few other code paths to consider... partly for different lengths, partly for boundary conditions. So I'll need to improve the tests I think.

I'll see if I can fix this but just for sake of safety, for 2.14 branch. I have a feeling that there's non-trivial chance for regression for either correctness or performance here.

cowtowncoder · 2022-02-26T19:43:47Z

Ok, fixed; yes, pre-padding is probably bit simpler although both work (and had to use post-padding for one case in Smile).
Curious as to whether there's any measurable effect on decoding speed; probably worth checking 2.13 vs 2.14 decoding speed with jackson-benchmarks (https://github.com/FasterXML/jackson-benchmarks/)

DaveCTurner mentioned this issue Feb 18, 2022

Malformed mapping can make index snapshots not restorable or mountable (searchable snapshots) elastic/elasticsearch#84146

Closed

cowtowncoder added 2.13 cbor smile has-failing-test Indicates that there exists a test case (under `failing/`) to reproduce the issue labels Feb 19, 2022

cowtowncoder added a commit that referenced this issue Feb 26, 2022

Add failing tests for #312 (cbor, smile)

7f203f1

cowtowncoder added a commit that referenced this issue Feb 26, 2022

Minor cleanup wrt test for #312

2f293e4

cowtowncoder added a commit that referenced this issue Feb 26, 2022

Beef up tests wrt #312

b50ac1f

cowtowncoder added 2.14 and removed 2.13 labels Feb 26, 2022

cowtowncoder added this to the 2.14.0 milestone Feb 26, 2022

cowtowncoder closed this as completed in b082f58 Feb 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

DaveCTurner commented Feb 18, 2022

DaveCTurner commented Feb 18, 2022

cowtowncoder commented Feb 19, 2022

cowtowncoder commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022 •

edited

Loading

cowtowncoder commented Feb 26, 2022

DaveCTurner commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

Comments

DaveCTurner commented Feb 18, 2022

DaveCTurner commented Feb 18, 2022

cowtowncoder commented Feb 19, 2022

cowtowncoder commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022 • edited Loading

cowtowncoder commented Feb 26, 2022

DaveCTurner commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022

cowtowncoder commented Feb 26, 2022 •

edited

Loading