.doc(x) to .md file conversion
This is the source page doc2md.docx which the conversion process is supposed to convert into doc2md.md.
This page works through the various requirements, one at a time. Various Astro techniques may be employed, including the possibility of .md and .mdx. To facilitate this development doc2md-md.md and doc2md-mdx.mdx are developed in parallel. If possible, MD will be preferable to MDX. MDX throws syntax errors on all sorts of innocuous-looking strings.
We need to be able to convert .doc and .docx files from my documentation store to .md(x) for Astro content creation.
We need a utility that will do the conversion automatically file by file. I don’t think we’ll ever need a block converter.
Various schemes have been tried:
-
pandoc. Very popular. Can’t cope with my paragraph style. Might be worth experimenting with md format options
-
Mammoth. Doesn’t seem to be configurable. Can’t cope with my paragraph style
This is a two column list of external links, to other documentation files in my store and web URLs.
Markdown can’t cope with columns, but simple HTML works in Astro MD.
Contents
What has to work
1. My sort of paragraph. See Paragraphs
2. Internal anchors and links
3. Tables. See Tables
4. Images. See Images
5. Columns
Markdown tables have to have a header by default:
| Feature | Status | Notes |
|---|---|---|
| Astro Layouts | Working | Using @layouts alias |
| Styles | Scoped | Testing specificity |
| Indentation | 2 Spaces | Configured in VS Code |
But putting them in this sort of <div> removes it, in conjunction with CSS in astro-test.css:
| 1 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc vel massa tincidunt, aliquam elit id, venenatis tellus. |
| 2 | Nunc molestie mauris et magna placerat tempus. |
| 3 | Phasellus sodales dolor enim, vel eleifend ante facilisis semper. |
| 4 | Integer vel dictum orci. |
| 5 | Praesent cursus ligula vel nisi rutrum, sit amet mollis tortor euismod. |
| 42 | Duis sollicitudin elit sit amet quam dictum congue. |
IMAGE-PLACEHOLDER
Laus Veneris, by Edward Burne Jones
See astro-test Textflow for more images with Fancybox effect.
This is my sort of paragraph which is proving so difficult to convert from .doc(x):
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc vel massa tincidunt, aliquam elit id, venenatis tellus.
Nunc molestie mauris et magna placerat tempus.
Phasellus sodales dolor enim, vel eleifend ante facilisis semper.
Integer vel dictum orci.
Praesent cursus ligula vel nisi rutrum, sit amet mollis tortor euismod.
Duis sollicitudin elit sit amet quam dictum congue.
-
% cd file-folder # the folder holding the file to be converted
-
% doc2md file-name.doc(x)
This creates file-folder/file-name.md
Intermediate files are created in file-folder/doc2md_intermediate. The folder may be deleted when the process is complete.
Copy the MD file to the appropriate Astro folder
~/winxp/projects/linux/bin/doc2md
Bash script controlling whole process
-
Convert .doc file into intermediate .docx with LibreOffice
OR copy .docx to intermediate folder. -
Convert <Enter> for newlines in paragraphs to <Shift/Enter> with fix_word_01.py, and save document title in a file for later.
-
Bracket 2-column sections with special strings for later with fix_word_02.py
-
Convert to MD with pandoc
-
Fix pandoc errors, convert section brackets to HTML, pick up document title and write frontmatter with fix_pandoc.py
-
Move .md file back to file-folder