Converted from doc(x) by doc2md
Posted on 22nd May 2026 at 02:06 by Admin
This is the source page test_docx.docx which the conversion process is supposed to convert into test-docx.md
If possible, MD will be preferable to MDX. MDX throws syntax errors on all sorts of innocuous-looking strings.
We need to be able to convert .doc and .docx files from my documentation store to .md(x) for Astro content creation.
We need a utility that will do the conversion automatically file by file. I don’t think we’ll ever need a block converter.
Various schemes have been tried:
-
pandoc. Very popular. Can’t cope with my paragraph style. Might be worth experimenting with md format options
-
Mammoth. Doesn’t seem to be configurable. Can’t cope with my paragraph style
Contents
-
My sort of paragraph. See Paragraphs
-
Internal anchors and links
-
Tables
-
Images
-
Columns. This is hardest. Relies in columns being lists of hyperlinks. What happens to hyperlinks in ordinary paragraphs?
Markdown tables have to have a header by default:
| Feature | Status | Notes |
|---|---|---|
| Astro Layouts | Working | Using @layouts alias |
| Styles | Scoped | Testing specificity |
| Indentation | 2 Spaces | Configured in VS Code |
Blank headers are removed with CSS in astro-test.css:
| 1 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc vel massa tincidunt, aliquam elit id, venenatis tellus. |
| 2 | Nunc molestie mauris et magna placerat tempus. |
| 3 | Phasellus sodales dolor enim, vel eleifend ante facilisis semper. |
| 4 | Integer vel dictum orci. |
| 5 | Praesent cursus ligula vel nisi rutrum, sit amet mollis tortor euismod. |
| 42 | Duis sollicitudin elit sit amet quam dictum congue. |
IMAGE-PLACEHOLDER
Laus Veneris, by Edward Burne Jones
This is an embedded image. All such images will require manual intervantion after conversion, so fix_word_01.py replaces images with ‘IMAGE-PLACEHOLDER’ text.
See astro-test Textflow for more images with Fancybox effect.
External images should be no problem ![]()
Paragraph with hyperlinks that could cause trouble
Try putting hyperlinks in various places to see if they cause trouble.
The first one is on its own:
ncvp home page at the start of a line
At the end of a line ncvp home page
In the middle of a line ncvp home page with text following
What about other lines in the same paragraph.
These don’t have links.
fix_word_01.py doesn’t fix any <Enter> terminated line which includes a link. This really isn’t a problem.
Section with CTRL/Enter newlines
Maecenas id sapien risus. Nullam varius id tellus quis volutpat. In a tempor tortor, cursus sodales enim. Integer sed felis et massa lacinia molestie volutpat quis dui.
Duis interdum libero sed velit ornare, at blandit nulla volutpat. Mauris quis mi arcu. Praesent vulputate accumsan tortor, et ullamcorper sapien tincidunt ut. Nulla a eleifend justo.
Nam facilisis efficitur risus, vitae ornare ex fringilla vitae. Phasellus neque justo, vehicula sed finibus non, posuere sed ligula.
Maecenas id sapien risus. Nullam varius id tellus quis volutpat. In a tempor tortor, cursus sodales enim. Integer sed felis et massa lacinia molestie volutpat quis dui.
Duis interdum libero sed velit ornare, at blandit nulla volutpat. Mauris quis mi arcu. Praesent vulputate accumsan tortor, et ullamcorper sapien tincidunt ut.
Nulla a eleifend justo.
Nam facilisis efficitur risus, vitae ornare ex fringilla vitae. Phasellus neque justo, vehicula sed finibus non, posuere sed ligula.
MD vs MDX vs Astro pages and posts
Gemini says:
Standard .md is often preferable for simple, text-heavy blog posts, but .mdx and .astro offer superior flexibility depending on your specific needs.
Choosing the right format depends entirely on the level of interactivity you want to achieve.
1. Markdown (.md)
Best for standard, text-heavy articles (blogs, documentation, and guides).
-
Pros: Standardized, highly portable, and offers the fastest build times. Astro processes it natively without requiring extra configurations.
-
Cons: You cannot embed UI components (like a custom interactive button, a React chart, or a newsletter signup) or execute JavaScript directly inside the post.
-
Use this if: Your content is purely text and images, and you want to keep your project lightweight.
2. MDX (.mdx)
Best for content that requires interactive elements or custom UI components.
-
Pros: Allows you to use Astro’s native component integrations. You can import and embed React, Svelte, or Vue components directly inline alongside your text.
-
Cons: Requires installing the official Astro MDX Integration. It is marginally slower to build and can sometimes cause formatting/linting quirks with tooling like Prettier.
-
Use this if: You need to embed custom components, dynamic data, or dynamic code snippets inside your articles.
3. Astro Pages (.astro)
Best for structural site pages (Home, About, Contact, and Tag Archives) or highly customized template layouts.
-
Pros: Gives you full control over HTML, CSS, JavaScript, and layout imports. You can easily pull in data from Astro Content Collections dynamically to list out your blog posts.
-
Cons: Takes much longer to write than standard Markdown if you are just drafting an article, as you have to manually code paragraphs, headings, and styling.
-
Use this if: You are building layout wrappers, index pages listing your posts, or interactive hub pages.
4. Summary Recommendation
Keep your actual pages/posts in .md if they are purely textual. Switch to .mdx only if you find yourself needing to inject custom interactive elements mid-post. Use .astro strictly for your site’s templates and structural pages.
- Some sort of bold text has been converted to **without space. Should be easy to fix in fix_pandoc.py. This highlights the fact that I need more sorts of headings in my doc files.
In practice, the only time I format a selection with columns is two columns of hyperlinks.
But what happens if I have two columns of general stuff?
Thing 1
Thing 2
Thing 3
Thing 4
Thing 5
Thing 6
Thing 7
Thing 8
Thing 9
That works. I’m not sure how.
Each stage of the conversion process generates a new -tempn.docx file. These are normally deleted at the end of the process, but may optionally be kept for debugging.
fix_word_01.py
-
Change <Enter> newlines within a paragraph to <Shift/Enter>.
Don’t touch ordered or unordered lists, or the lists of links within two column sections -
Replace any images with ‘IMAGE-PLACEHOLDER’. They’re going to need manual intervention
-
Extract the document title from the header and save it in a file for fix_pandoc.py to add to the frontmatter
fix_word_02.py
Locate 2-column sections and bracket them with special text for fix_pandoc to replace with HTML
fix_pandoc.py
Change <Enter> newlines within a paragraph to <Shift/Enter>.
Don’t touch ordered or unordered lists, or the lists of links within two column sections
Replace any images with ‘IMAGE-PLACEHOLDER’. They’re going to need manual intervention
Extract the document title from the header and save it in a file for fix_pandoc.py to add to the frontmatter
Mainly the deletion of all the random sections which seem to appear, and the re-instatement of the 2-column sections