Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing compose IDs is (practically) impossible if - is allowed in product short name #77

Open
AdamWill opened this issue Feb 5, 2017 · 8 comments

Comments

@AdamWill
Copy link
Contributor

AdamWill commented Feb 5, 2017

I'm not sure if this is an issue or a cry for help...:)

Consider these compose IDs (one a real Fedora compose ID, one a made-up layered product compose ID from the test suite, since I don't know where to find a real RHEL layered product compose ID):

Fedora-Atomic-25-20170205.0 (product short name: Fedora-Atomic)
F-22-updates-BASE-3-updates-20160622.n.0 (product short name: F)

Now I read the code, all kinds of other subtleties are possible, which makes parsing compose IDs just super fun. But you could, I think, cope with many if not all of them by counting - characters. Maybe. I don't know. Life is awful.

However, as long as - characters are allowed in the product short name, all bets are off, because it just becomes entirely impossible to figure out which -s are being used as some kind of field separator by productmd, and which are just part of the product name (as in the first example).

I've been doing this to parse compose IDs in Fedoraland:

(name, version, _) = compose_id.rsplit('-', 2)
(date, typ, respin) = productmd.composeinfo.get_date_type_respin(compose_id)

The rsplit trick is to handle there being dashes in the shortname, and it relies on there being a known number of dashes after the shortname. But now I look at the test cases and code upstream, it's clear this is rather specific to my use cases and is relying on the fact that we don't have layered products (or at least, if we do, I don't care about any of 'em), or any type_suffixes (at least ones with dashes in them? I don't really know what these are), and I certainly don't care about RHEL 5 composes. But clearly someone does care about all those cases, so putting a generic compose ID parser in productmd (which is what I was trying to do) appears to be really rather futile, or at the least, would be very complex and have quite a bit of magic knowledge in it.

So...I don't know what to do. Would it be worthwhile making it a rule that product short names can't have dashes in them, and changing Fedora's configurations appropriately (to use FedoraAtomic, FedoraCloud and FedoraDocker as the short names, or something)? Or should I just give up on generically parsing compose IDs and rejig everything I have that does it, somehow (probably by requiring a round trip to the compose location or to PDC to fetch the full compose metadata, unfortunately)? Currently we have various things that get useful information just by parsing the compose ID because that's the only thing they get from a fedmsg they parse, and it's nice to be able to avoid a round trip. Doing this kind of rejigging for fedfind would also be rather tricky architecturally, but that's more or less my problem and not yours...

@lubomir
Copy link
Contributor

lubomir commented Feb 6, 2017

If I remember correctly, the assumption was that compose IDs should be opaque data and were not meant to be parsed. I can however see that there clearly are use cases where it's a relatively nice way to get some data.

Fedora indeed does not have layered products. (A layered product generally is a non-operating system thing; the base product then is the operating system it's meant to run on.)

Re type_suffix: this is based on compose type (production, test, nightly etc.). There can never be a - character in it. For ga it's empty, in all other cases it's a . followed by one or two characters.

Adding a rule to forbid dashes in short names would reduce the pain, and I think it can be done for new products. I'm not sure though if enforcing it at productmd level is the best idea (as opposed to doing it in tooling like Pungi to not allow creating new stuff that would be hard to work with).

@dmach might have some more insight into this.

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 7, 2017

"Re type_suffix: this is based on compose type (production, test, nightly etc.). There can never be a - character in it. For ga it's empty, in all other cases it's a . followed by one or two characters."

Sorry, I was talking about the release.type_suffix, not the compose.type_suffix. See https://github.com/release-engineering/productmd/blob/master/productmd/composeinfo.py#L140 . There can also apparently be a layered product type_suffix (see a few lines down)...I know what a compose type is, I'm not clear what the type suffix for a release or a layered product would be.

@lubomir
Copy link
Contributor

lubomir commented Feb 10, 2017

The release type is just a value that should describe the life cycle of the release. The possible values are listed at https://github.com/release-engineering/productmd/blob/master/productmd/common.py#L105-L112

If the value is ga, the type suffix is empty, otherwise it will use the exact value from the type.

As far as I know, Fedora is almost exclusively using ga. There are some updates releases in PDC, but I don't know if composes are being done for any of them.

For layered products this is a bit uglier: both the layered product and the base product can have types and they can be different (for example ga layered product on a eus base product).

@AdamWill
Copy link
Contributor Author

Is there any rule about having or not having a - in one of those values? :)

@AdamWill
Copy link
Contributor Author

OK, it looks like neither release.type_suffix nor base_product.type_suffix can have a - in it. That's good news. Aside from the 'dashes in shortnames' problem, I think that means we can actually write a reliable parser. However, the shortname rules still make it really hard:

#: Validation regex for release short name: [a-z] followed by [a-z0-9] separated with dashes.
RELEASE_SHORT_RE = re.compile(r"^[a-z]+([a-z0-9]*-?[a-z0-9]+)*$")

since the shortname can have both dashes and digits in it, we really can't reliably tell where the shortname ends and the release begins.

My best plan so far is to do the parsing in reverse (just like the existing get_date_type_respin does), but that's complicated by the fact that release versions can also have dashes in them! Oh joy:

#: Validation regex for release version: any string or [0-9] separated with dots.
RELEASE_VERSION_RE = re.compile(r"^([^0-9].*|([0-9]+(\.?[0-9]+)*))$")

so again we're stymied there, at least in theory.

Do we actually have any release versions with dashes in them, though? Fedora only has 'Rawhide' and integers. If we don't have any existing products using a dash in their release version, I think we can make it possible to reliably parse the compose ID if we just change that regex and make it a rule that versions can't have a dash in them. That'd be a less disruptive change than disallowing dashes in shortnames, since Fedora is already using shortnames with dashes in.

As a sidenote, btw, I'm not sure if that RELEASE_VERSION_RE regex actually does what it intends? I'm not clear what 'any string' means, but just about the only thing that regex doesn't allow is something that starts with a digit and then has anything but other digits and numbers:

>>> RELEASE_VERSION_RE.match('foobarmoo#%0-(#^(*()#^#')
<_sre.SRE_Match object at 0x7f8bff6a4b78>
>>> RELEASE_VERSION_RE.match('lalsaaksgosjagiopajhg-903ui690326iu90234udgajiopgadjnmkgald,.asghnajhga')
<_sre.SRE_Match object at 0x7f8bff6a4ae0>
>>> RELEASE_VERSION_RE.match('1a')
>>> 

@lubomir
Copy link
Contributor

lubomir commented Feb 13, 2017

I don't think there are any versions with dashes in them either.

Originally the version was required to be numeric (with components separated by .), and with the requirement to support Rawhide (#11) the other piece was added so that an arbitrary string works as well. I think tightening it up to at least prohibit dashes would not break anything, as Rawhide is the only release using non numeric version.

I think the regex works as intended, but that intention could be considered flawed: the version can be pretty much anything, but as long as it starts with a digit, it must follow the X.Y.Z format (for any number of components). This way looked saner that whitelisting Rawhide as the only exception.

@AdamWill
Copy link
Contributor Author

okay. so i think i'm gonna find a bit of time to see if I actually can write a reliable parser if we pretend that versions aren't allowed to have dashes in them, and if that's the case, it might be worth enforcing that rule. good to know the exception was only added for Rawhide, I wasn't sure about that.

@AdamWill
Copy link
Contributor Author

OK, so the best I could do is to write a parser which I think works if we make one change: make it so that versions can only be a 'normal' numerical-ish version (1, 26, 1.3.4.65.37, you know the drill) or 'Rawhide'. Or, at least, a specific list of strings, rather than allowing 'any string that doesn't start with a digit'.

I don't think I can do any better than that - I don't think just enforcing 'no dashes in versions' is really enough, now I sat down and wrote it.

The parser I wrote correctly parses all the compose IDs in fedfind's test suite, and also all the compose IDs in productmd's TestCreateComposeID test. It's currently a fedfind PR, because I was using it to test Pagure CI:

https://pagure.io/fedora-qa/fedfind/pull-request/7

I can adapt it as a productmd PR tomorrow, though, if you think it's worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants