Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cesspool that is MS Word #2

Open
StevenBlack opened this issue Jan 10, 2018 · 8 comments
Open

The cesspool that is MS Word #2

StevenBlack opened this issue Jan 10, 2018 · 8 comments

Comments

@StevenBlack
Copy link
Member

StevenBlack commented Jan 10, 2018

Presently stymied by the utter BS that is MS word.

This issue is a place to list content patterns, and devise tactics, to extract content from Word.

This issue is a spec that is a work in process.

Feel free to edit and augment comments as opposed to adding to the thread. We can create separate issues for individual tactics, cross-referencing them back here.

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – structures

I wonder, are these all the same, having headers with at least one identifiable first column like ["Field", "Field Name", ...]? If so one transform might be good for them all. Note there can be a variable number of columns in these tables.

Examples from S5C3.

2018-01-10_11-06-23

2018-01-10_11-13-52

DH: I don't think they're all the same, unfortunately. Here's one from S5C2, which uses "Field" rather than "Field Name".

s5c2-5-2

Another

2018-01-10_14-08-04

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – property lists

Example from S5C3.

2018-01-10_11-16-08

DH: I didn't check too many docs, but in S4, I think there's some consistency: this is S4G489:

s4g489

S4G548 has a table with the same headers.

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – property values

Example from S5C3.

2018-01-10_11-18-42

DH: Probably no consistency here. This is from S4G387; note the property name (RowSourceType) in the column header:

s4g387

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – callouts

Cool

Example from S5C3.

2018-01-10_11-29-17

Design

Example from S5C2.

2018-01-10_11-44-14

Bug

Example from S5C2.

2018-01-10_11-47-01

DH: I'm guessing these are consistent since they're basically two column tables with no header.

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – code examples

Possibly easy to identify as squat 2-column table with Example in column 1.

Example from S5C3.

2018-01-10_11-32-05

DH: I'm pretty sure this is consistent.

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables – mundane lists

These might be easy to identify as 2-column tables with headers of the form <something that is not ["Value", ...]> and ["Description", "Purpose"...].

Examples from S5C3.

2018-01-10_11-35-20

2018-01-10_11-41-41

Examples from S1C2.

2018-01-10_14-05-01

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables - "colspan" headers

These will likely be problematic. Easy enough to identify, I reckon.

Examples from S1C2.

2018-01-10_14-01-25

@StevenBlack
Copy link
Member Author

StevenBlack commented Jan 10, 2018

Tables — simple

And then we have simple, generic tables; these should be easy. These are just another mundane list?

Example from S1C5.

2018-01-10_14-11-20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant