RevisionHistory.txt

Validate Fasta File Change Log

Version 2.3.7994; November 20, 2021
	- Fix logic bug for rules that have false for MatchIndicatesProblem
	- Warn if the protein sequence is likely DNA bases
	- Report an error if a line starts with a non-breaking space or tab
	- Add ability to read .fasta.gz files
	- Look for the Unicode replacement character after protein names

Version 2.2.7887; August 5, 2021
	- When /O is used to specify an output directory, the fixed FASTA file will now be created in the output directory instead of the input directory

Version 2.2.7794; May 4, 2021
	- Convert to C#
	- Add .faa as a supported file extension
	- Report a warning if there is a non-breaking space after the protein name
	- Add test file with UTF-8 encoding

Version 2.2.7410; April 15, 2020
	- Check for long lines (with millions of residues or invalid characters)
	- Update to .NET 4.7.2 and update PRISM.dll from NuGet

Version 2.2.7006; March 8, 2019
	- Use ComputeStringHashSha1 in PRISM.dll

Version 2.2.6622; February 17, 2018
	- Remove interfaces and use FileProcessor classes in PRISM.dll

Version 2.2.6471; September 19, 2017
	- Update to .NET 4.6.2
	- Use methods in PRISM.dll
	- Convert from Hashtable to Dictionary
	- Convert from ArrayList to generic lists

Version 2.1.6060; August 4, 2016
	- Increase default maximum protein name length from 34 to 60 characters

Version 2.1.5926; March 23, 2016
	- Fix bug determining the spanner key length for protein names
	- Add warnings for protein residues B, J, O, U, X, or Z (previously treated J as an error and X as a warning)

Version 2.1.5885; February 11, 2016
	- Updated behavior of /HashFile to store protein names in memory; not sequence hash values
		- Results in decreased memory usage
	- Updated clsNestedStringDictionary and clsNestedStringIntList to use StringComparer.Ordinal
	- Now removing leading and trailing whitespace from residues
	- Now auto-removing any asterisks at the end of protein sequences

Version 2.1.5884; February 10, 2016
	- Added switch /HashFile for loading a file previously created using /B
		- When /HashFile is used, the hash values are stored in a list rather than in a dictionary, reducing memory usage

Version 2.1.5877; February 3, 2016
	- Replaced dictionaries with clsNestedStringDictionary to allow for decreased memory usage
	- Replaced struct with a class for better memory allocation
	- Now reporting memory usage at regular intervals
	- Added clsStackTraceFormatter

Version 2.1.5773; October 22, 2015
	- Added option AllowAllSymbolsInProteinNames

Version 2.1.5611; May 13, 2015
	- Now allowing pound signs (#) in protein names

Version 2.1.5605; May 7, 2015
	- Implemented switches /AllowDash and /AllowAsterisk  (previously recognized, but had no effect)
	- Added switches /SkipDupeSeqCheck/ and SkipDupeNameCheck
	- Fixed bug in modMain that would set the rules using SetOptionSwitch but would not re-define the rules by calling SetDefaultRules

Version 2.1.5416; October 30, 2014
	- Now adding a summary line to the stats listing number of proteins, number of residues, and file size in KB

Version 2.1.5371; September 15, 2014
	- Now allowing for files to be opened even if another application has them open with a read/write lock

Version 2.1.5350; August 25, 2014
	- Now limiting the protein description to 7995 characters in length when consolidating duplicate proteins
	- Made ComputeProteinHash public

Version 2.1.5053; November 1, 2013
	- Added clsProcessFilesOrFoldersBase

Version 2.1.4808; March 27, 2013
	- Switched to AnyCPU
	- Removed debug statement

Version 2.1.4808; March 1, 2013
	- Updated ValidateFastaFile.dll to .NET 4
	- New version of clsParseCommandLine.vb and clsProcessFilesBaseClass.vb

Version 2.1.4646; September 20, 2012
	- Added option /KeepSameName
	- Added option /V to indicate that invalid residues (non-letters) should be removed
	- Updated to .NET 4
		- Replaced several hashtable objects with generic lists

Version 2.0.4486; April 13, 2012
	- Now showing error messages at the console instead of as popup message boxes

Version 2.0.4472; March 30, 2012
	- Allow equals and plus signs in protein names

Version 2.0.4415; February 2, 2012
	- Changed DEFAULT_MAXIMUM_PROTEIN_NAME_LENGTH to a public constant

Version 2.0.4276; September 16, 2011
	- Updated to Visual Studio 2010
	- Replaced .Now with .UtcNow

Version 2.0.3863; July 30, 2010
	- Updated to Visual Studio 2008 (.NET 2.0)
	- Now auto-enabling mCheckForDuplicateProteinNames if generating a fixed fasta and RenameProteinsWithDuplicateNames = True

Version 2.0.3637; December 16, 2009
	- Fixed bug that failed to output the column number to the FastaFileStats text file

Version 2.0.3597; November 6, 2009
	- Added option AllowDashInResidues

Version 2.0.3131; July 28, 2008
	- Added option SaveBasicProteinHashInfoFile
		- Useful for parsing huge .Fasta files to look for duplicate protein sequences without storing the protein names and sequences in memory

Version 2.0.3044; May 2, 2008
	- Added a new section to the XML parameter file: ValidateFastaFixedFASTAFileOptions
		- Moved parameters GenerateFixedFASTAFile and SplitOutMultipleRefsinProteinName into this new section
		- Renamed parameters FixedFastaRenameDuplicateNameProteins and FixedFastaConsolidateDuplicateProteinSeqs to RenameDuplicateNameProteins and ConsolidateDuplicateProteinSeqs
	- Added additional Fixed Fasta options
		- Ability to control whether or not long protein names are truncated (setting TruncateLongProteinNames)
		- Added option WrapLongResidueLines
			- When true, then wraps residue lines to length MaximumResiduesPerLine (default is 120 characters)
		- Added option RemoveInvalidResidues
			- When true, then characters that are not A-Z will be removed from the residues
			- When false, no residue characters are removed
		- You can now define additional special character lists in the XML parameter file:
			- LongProteinNameSplitChars
			- ProteinNameInvalidCharsToRemove
			- ProteinNameFirstRefSepChars
			- ProteinNameSubsequentRefSepChars
		- Added new option for splitting out multiple references from protein names, whereby this process is only performed if the name matches a known pattern
			- Known patterns are IPI, gi, and jgi
			- Examples:
				- IPI:IPI00048500.11|ref|23848934  is split to IPI:IPI00048500.11 ref|23848934
				- gi|169602219|ref|XP_001794531.1| is split to gi|169602219 ref|XP_001794531.1|
				- jgi|Batde5|90624|GP3.061830      is split to jgi|Batde590624 GP3.061830

Version 2.0.2900; December 10, 2007
	- When generating a fixed fasta file, the program can now ignore I and L residue differences when consolidating proteins that have duplicate sequences
		- Use /L in the command line

Version 2.0.2896; December 6, 2007
	- When generating a fixed fasta file, the program can now optionally consolidate proteins that have duplicate sequences
		- Use /D in the command line

Version 2.0.2771; August 3, 2007
	- When generating a fixed fasta file, the program will also now create a text file that lists the hash value for each unique sequence found in the input file
		- Also included is the first protein name with for each sequence, and details on the other proteins that have the same exact sequence 
		- Additionally, if duplicate proteins are found, then a mapping file will be created that lists the first protein name, the sequence length, and each duplicate protein name
	- Removed the dependence on PRISM.Dll and SharedVBNetRoutines.dll by adding files clsParseCommandLine.vb and clsXmlSettingsFileAccessor.vb to the project

Version 2.0.2769; August 1, 2007
	- Added option to rename proteins with duplicate names rather than skipping them when generating a fixed fasta file
		- When renaming, first tries appending "-b" to the original protein name, then checks if this new name matches any other protein names
		- If the new name still results in a duplicate protein, then "-c", "-d", "-e", etc. are tried
	- Fixed bug that incorrectly allowed residue lines to be written to a fixed fasta file when an invalid protein was encountered (having a duplicate name or a space after the > symbol)

Version 2.0.2684; May 8, 2007
	- Added new warning that checks for duplicate protein sequences
	- Now clearing the protein name and description each time a line that starts with > is encountered

Version 2.0.2347; June 5, 2006
	- Added new warning that checks for protein descriptions over 900 characters long

Version 2.0.2203; January 12, 2006
	- No longer trimming spaces from the end of a line since we need to be able to check for residue lines ending in spaces
	- Updated the logging (file or console) to include the value and context associated with a given message, warning, or error

Version 2.0.2103; October 4, 2005
	- Updated the SaveSettingsToParameterFile function to re-open the .Xml file and replace instances of "&gt;" with ">" to improve readability

Version 2.0.2056; August 18, 2005
	- Added new validation rule to look for escape code characters in the protein description
	- Added option to fix the line terminator to guarantee it ends in CRLF

Version 2.0.2028; July 22, 2005
	- Stable release