XHTML 1.0 Transitional parser?

posted 1 month ago

So I’m trying to parse school’s website for some info. I’m trying to get some values using xpath. So I found a html 5 parser and it can’t properly parse the first line. Then I figure you it’s actually XHTML and not HTML. After quick Google search I found out XHTML can be properly parsed using any XML parser and so I found one and… It can’t parse the first line. So I ask LLama3.1 (like a real programmer) why I can’t parse the first line with any parser. It explained so nicely that I did not destroy my keyboard when I was told that this document is “XHTML 1.0 Transitional” and it’s a mix of HTML 4 and XHTML and can’t be parsed with HTML nor XML parser. I hate the guy that invented that so much…

So I can’t find a crate to parse XHTML 1.0 transitional? Or a crate to convert xhtml to something else? Any advice?

Sort:

Hot Top Controversial New Old

You are viewing a single thread.

View all comments

[ - ]

calcopiritus@lemmy.world

2 points

1 month ago

HTML is hard to parse because it allows mistakes.

I don’t know the answer to your question. But if it was me, I’d run the HTML parser until it encounters an error, manually fix the error, then try to parse again. Until it parses correctly.

permalink

report

Rust

!rust@programming.dev

Create post

Welcome to the Rust community! This is a place to discuss about the Rust programming language.

Wormhole

!performance@programming.dev

Credits

The icon is a modified version of the official rust logo (changing the colors to a gradient and black background)

Community stats

718
Monthly active users
870
Posts
3.5K
Comments

Wormhole

Community stats

Community moderators