Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally apply merged cells across their range #40

Open
nacnudus opened this issue Jul 31, 2018 · 21 comments
Open

Optionally apply merged cells across their range #40

nacnudus opened this issue Jul 31, 2018 · 21 comments

Comments

@nacnudus
Copy link
Owner

E.g. if a cell with value "foo" is merged across A1:A2 then cell A2 should be created/overridden as a clone.

@dmcalli2
Copy link

On this topic, I was wondering what the default behavior was with tidyxl when it encounters merged cells.
If, for example, cells A1:B2 is a single merged cell containing the value "some_value", does this translate to the following?

row col value
1 1 "some_value"
1 2 NA
2 1 NA
2 2 NA

If so, I think the enhead function in https://github.com/nacnudus/unpivotr will handle this problem fairly easily.

@nacnudus
Copy link
Owner Author

That's correct, and is the reason why I haven't thought of a need for this yet. It might still be useful to know when a cell is part of a merge.

@ianmoran11
Copy link

ianmoran11 commented Mar 30, 2019

This would be super useful for tidyABS. At the moment, I'm inferring whether cells are merged based on either style or local_format_id (see here). Whether style or local_format_id work depends on the spreadsheet. A more reliable approach to unmerging would be great.

@nacnudus
Copy link
Owner Author

@ianmoran11 thanks for commenting here. I've skimmed ianmoran11/tidyABS@d34fd3f to try to understand the use of unmerging cells, but I don't understand yet. Is it part of guessing which cells are headers? If so, would it be enough to know that a cell is merged? That would be much easier to implement than the next step, which is to clone that cell across the merged range.

@ianmoran11
Copy link

Sorry, I missed your reply. Unmerging column headers, in ABS tables, increases the chance that N gives the correct relationship between the table data and the header (and this seems to be true of many other tables too). Without unmerging the column header, tidyABS would need to guess between NNW or N for each column header group, which turns out to be difficult.

Knowing whether the cell is merged would allow me to know with more certainty whether I should replace a cell's value with the value of the cell to the left.

I'm hope that explanation is clear - happy to provide a reprex to clarify.

@nacnudus
Copy link
Owner Author

nacnudus commented Apr 7, 2019

I see. Could tidyABS use NNW every time? N is a convenient way to filter out cells that are outside the table (they don't have any headers), but otherwise it makes no difference. If tidyABS has another way to drop cells outside the table, then I would recommend this approach because it generalises to header cells that haven't been merged.

@ianmoran11
Copy link

Using NNW every time was my initial approach, but it runs into trouble with some tables. See, for example, the table below.

By the way, I don't have a good understanding of how much work would be involved in adding this feature, and tidyABS gets by most of the time with its current approach to unmerging and filling cells, so this is far from an urgent request - just a nice-to-have.

image

@nacnudus
Copy link
Owner Author

nacnudus commented Apr 8, 2019

Excellent example. I'm persuaded.

I had a go at this last night and it's just fiddly, not actually difficult.

@bedantaguru
Copy link

Alternatively, two additional columns can be added
is_merged_cell, merged_cell_id

is_merged_cell should be TRUE for merged cells

The merged_cell_id can be given sequentially (as per reading sequence maintained in xlsx_cells) over entire xlsx file. For each cell in a single merged cell (which is a set of cells) should receive the same id and in case of non merged cell that can be simply NA.

@dmcalli2
Copy link

Just wanted to say that this would be helpful for my usage too. @rpsoft I am using tidyxl to re-arrange tables from published papers into a tidy format.
The suggestion above is great,

Alternatively, two additional columns can be added
is_merged_cell, merged_cell_id

is_merged_cell should be TRUE for merged cells

The merged_cell_id can be given sequentially (as per reading sequence maintained in xlsx_cells) over entire xlsx file. For each cell in a single merged cell (which is a set of cells) should receive the same id and in case of non merged cell that can be simply NA.

However, I wonder if merged_cell_id should not be NA, but rather should be the address for the top left cell, and in the case of single cells, this is of course just the cell reference.
This would fit in well wiht the tidyverse, as it would make it easier to do grouped operations on merged cells, eg

## Repeat the character string across the range of merged cells
mydf %>%
  arrange(row, cell) %>%
  group_by(merged_cell_id) %>%
  mutate(character = character[1]) %>%
  ungroup()

@nacnudus
Copy link
Owner Author

Thanks @bedantaguru and @dmcalli2 for the suggestions.

What do you think about:

  • not having an is_merged column, and
  • having both merged_cell_sheet and merged_cell_address columns to store the full address of the top-left cell in the merge?

@dmcalli2
Copy link

Sounds great. Do you think that it would be good to set the value of merged_cell_address to the cell value when it is not a merged cell so it can be grouped-on in a single step? Or is that counter-intuitive for users?
I suppose it would only require an extra step if merged_cell_address is set to NA for non-merged cells.

mydf %>%
  arrange(row, cell) %>%
  mutate(merge_on= if_else(is.na(merged_cell_address), cell_address, merged_cell_address) %>% 
  group_by(merge_on) %>%
  mutate(character = character[1]) %>%
  ungroup()

@nacnudus
Copy link
Owner Author

Oh yes, I meant to ask what value you would expect merged cells to have (other than the top-left). I had assumed they ought to be pre-populated with the value of the top-left cell, in which case your example wouldn't be necessary.

@dmcalli2
Copy link

I think you are right.

@bedantaguru
Copy link

Thanks @bedantaguru and @dmcalli2 for the suggestions.

What do you think about:

  • not having an is_merged column, and
  • having both merged_cell_sheet and merged_cell_address columns to store the full address of the top-left cell in the merge?

@nacnudus agree to not have is_merged but in that case, why do we need two columns.
I'm sorry but I did not get why we need merged_cell_sheet (sheet column is already there)(can a merged cell spread across sheets ? if so then fine.)

I guess as you suggested merged_cell_address makes sense. (then you don't need to do it optionally; in xlsx_cells you can have it by default)

In fact, your initial suggestion is also perfect (to optionally populate same fields for all cells in the merge). But in this case, you need an optional argument in xlsx_cells.

If the operation is not costly then you may choose the first one otherwise go for the second.

Either way, it looks perfectly fine.

@bedantaguru
Copy link

I can see the xlsx contains information like these :
<mergeCells count="1"><mergeCell ref="A1:B1"/></mergeCells>
for merged cells.

@dmcalli2
Copy link

Is the desired behaviour agreed now (see below) and if so can we (myself and @rpsoft) help with the implementation, as this feature would be very useful for our current project?

Desired behaviour

  • Repeat top left value across all merged cells. Do so as the default with the option to suppress this behaviour xlsx_cells(repeat_across_merged = TRUE) .

@nacnudus
Copy link
Owner Author

@bedantaguru you are right, merged_cell_sheet is not necessary.

@dmcalli2 and @rpsoft thank you for offering to help with the implementation, which will be as follows (suggestions still accepted though!):

  1. A new column merged_cell_address in the data frame returned by xlsx_cells(). It will be NA for cells that are not merged, otherwise it will be the address of the top-left cell in the merge. The top-left cell will have its own address in the merged_cell_address column. This will need to be in the C++ code.
  2. Functions unmerge_cells(cells) and merge_cells(cells). They will need to cope when cells from different sheets are in the same data frame, and they should do something sensible when the cells to be merged or unmerged don't exist (i.e. the rectangle isn't complete, especially if the top-left cell is missing). These should be fine in R code, the speed should be fine.

I haven't checked whether the formatting of a merged cell is the same as the top-left cell in the merge. Hopefully it is, otherwise unmerging will be a Grade A pain.

Sorry for being slow to develop this. I don't have much time to spend on tidyxl at the moment.

@dmcalli2
Copy link

Great. I'll fork the repo, talk with @rpsoft and make a plan.

@bedantaguru
Copy link

How are you all guys..

Have you given it a try...

@miljukov-o
Copy link

Hi, also desperately needing this feature...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants