Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom space symbol #83

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gwenniger
Copy link

I added an extra argument to the decoders to allow specification of a custom space symbol. Currently the space symbol used by the decoder is hard-coded to be " ". This is probably fine in most cases, but it does not work for example for my problem domain of handwriting recognition, in which the word separator can be a special symbol such as "|" and the normal space symbol " " may be not used at all.

specify the blank symbol with. The default blank symbol is
" ". This was previously hard-coded in the decoder, causing
problems when using it for handwriting recognition where the
blank symbol may be different, for example "|".

	modified:   ctcdecode/__init__.py
	modified:   ctcdecode/src/binding.cpp
	modified:   ctcdecode/src/binding.h
	modified:   ctcdecode/src/ctc_beam_search_decoder.cpp
	modified:   ctcdecode/src/ctc_beam_search_decoder.h
	modified:   ctcdecode/src/decoder_utils.cpp
	modified:   ctcdecode/src/scorer.cpp
	modified:   ctcdecode/src/scorer.h
…be called on the space

symbol before passing when passing it as an argument
to "ctc_decode.paddle_get_scorer" and  "ctc_decode.paddle_beam_decode" and
"ctc_decode.paddle_beam_decode_lm". Otherwise it  will not be in the const* char format expected
by the c++ interface for these parameters.

	modified:   ctcdecode/__init__.py
Copy link
Collaborator

@ryanleary ryanleary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contrib, Gideon. A few minor nits for you to correct and we'll get it merged in. Thanks!

@@ -35,7 +36,9 @@ std::vector<std::pair<double, Output>> ctc_beam_search_decoder(
// size_t blank_id = vocabulary.size();

// assign space id
auto it = std::find(vocabulary.begin(), vocabulary.end(), " ");
// Changed by Gideon from the blank symbol " " to a custom symbol specified as argument
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code tells no lies. Please remove this comment and the line of code you commented out below (line 41). That's what git history is for.

@@ -153,7 +153,8 @@ bool add_word_to_dictionary(
std::vector<int> int_word;

for (auto &c : characters) {
if (c == " ") {
// if (c == " ") {
if (c == "|") { // Gideon: replaced the space symbol " " => "|"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be hardcoded. You'll have to parameterize the space character. Though, looking at it more closely, it looks like you could probably just do a lookup based on the SPACE_ID param...

@@ -16,7 +16,8 @@ using namespace lm::ngram;
Scorer::Scorer(double alpha,
double beta,
const std::string& lm_path,
const std::vector<std::string>& vocab_list) {
const std::vector<std::string>& vocab_list,
const std::string &space_symbol) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please set a default arg here.

if (labels.empty()) return {};

std::string s = vec2str(labels);
std::vector<std::string> words;
if (is_character_based_) {
words = split_utf8_str(s);
} else {
words = split_str(s, " ");
// words = split_str(s, " ");
words = split_str(s, space_symbol); //Gideon: replaced the space character from " " to a custom string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete dead code and comment

if (char_list_[i] == " ") {
SPACE_ID_ = i;
//if (char_list_[i] == " ") {
if (char_list_[i] == space_symbol) { //Gideon: replaced the space character from " " to a custom string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete dead code and comment

…h, which

facilitates usage with pytorch 1.0.

Merge branch 'master' of https://github.com/parlance/ctcdecode into custom_space_symbol

	ctcdecode/__init__.py
	ctcdecode/src/binding.cpp
	ctcdecode/src/binding.h
	ctcdecode/src/ctc_beam_search_decoder.cpp
	ctcdecode/src/ctc_beam_search_decoder.h

 Changes to be committed:
	modified:   README.md
	modified:   build.py
	modified:   ctcdecode/__init__.py
	modified:   ctcdecode/src/binding.cpp
	modified:   ctcdecode/src/binding.h
	modified:   ctcdecode/src/ctc_beam_search_decoder.cpp
	modified:   ctcdecode/src/ctc_beam_search_decoder.h
	modified:   ctcdecode/src/decoder_utils.cpp
	modified:   ctcdecode/src/decoder_utils.h
	modified:   ctcdecode/src/path_trie.cpp
	modified:   ctcdecode/src/path_trie.h
	modified:   requirements.txt
	modified:   setup.py
	modified:   tests/test.py
"self._space_symbol.encode()" was swapped during merging of the code.

-                                          self._cutoff_prob, self.cutoff_top_n, self._blank_id,self._log_probs, self._space_symbol.encode(), output, timesteps,
-                                          scores, out_seq_len)
+                                          self._cutoff_prob, self.cutoff_top_n, self._blank_id,  self._space_symbol.encode(), self._log_probs,
+                                          output, timesteps, scores, out_seq_len)

	modified:   ctcdecode/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants