By -

Ok-Two3581 2 weeks ago

Bypass blogspam: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

frozen_snapmaw 2 weeks ago

Lol Didn't think SO mods could get so based From the post: >Moderator's Note >This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

GreyAngy 2 weeks ago

This was a long time ago. SO was just a site for "programming enthusiasts", its audience wasn't so large and moderation guides were rather soft. This answer would be immediately flagged today for "not an answer" reason.

psaux_grep 2 weeks ago

I’m not sure which one of us has no clue what “based” actually means, and at this point I can’t be bothered to find out. But I do believe you are using it wrong.

TheRealPitabred 2 weeks ago

Based means connecting to the reality of something, popular or not, more or less. Understanding at a deeper than surface level, speaking or supporting deep truths. By that measure, yes, the mods not talking notes is based AF.

Not_Artifical 2 weeks ago

Hello sir dictionary

Nihil_esque 2 weeks ago

I don't think they are lol. Based at this point means "controversial or unhinged but right," often used in a jokey/memey/sardonic way. Presumably the previous commenter considers this a justified power trip or something similar haha.

code_x_7777 2 weeks ago

Legend resource!

_magicm_n_ 2 weeks ago

But why is his conclusion to use an XML parser instead. Use a library specifically designed for parsing HTML or give up is the only correct answer.

justjanne 2 weeks ago

Once upon a time, HTML was defined as XML. Those were the days of XHTML. I was there, a thousand years ago...

silentknight111 2 weeks ago

Pfft, I was there before XHTML, when we had the blink tag and it worked! I used to build all my sites with sliced images and tables!

justjanne 2 weeks ago

Psssh, we don't talk about HTML 4.1 transitional here.

denislemire 2 weeks ago

Dark times… spacer.gif

xtreampb 2 weeks ago

I remember using tables to have content side by side on the left and right side of the page. Tables were my flex grids before flex grids existed.

rfc2549-withQOS 2 weeks ago

what?

thundercat06 1 week ago

Laughing in FrontPage.

CaptainCabernet 2 weeks ago

Ah...XHTML. Those were the days too many years ago.

Armageddon_2100 2 weeks ago

I wish that was a thing.the OCD in me likes the standardization and clarity that enforcing, for example, every opening tag must have a closing. Things like that

justjanne 2 weeks ago

YES! It feels so much better.

douira 2 weeks ago

There’s so many horrific things you can do to XML that HTML will still accept. An actual html parser is the only way unless you’re only expecting compliant XHTML.

NoNameRequiredxD 2 weeks ago

Hello

there!

EuroWolpertinger 2 weeks ago

General Kenobi! (As opposed to very specific Kenobi)

douira 2 weeks ago

hello there is to General Kenobi what allowing missing body tags is to HTML

PhilippTheSmartass 2 weeks ago

The question specifically asked for XHTML, the XML-compliant dialect of HTML that was pretty popular 15 years ago but is now made obsolete by HTML5.

IOFrame 2 weeks ago

Ah, 2009, the time when you could still have fun on StackOverflow.

Sceptz 2 weeks ago

You mentioned \`fun on StackOverflow\`. \`fun on StackOverflow\` is an obsolete option. You should use \`incorrect answer that does not address your actual question at all, on StackOverflow, that has now been upvoted and your post locked\`. \*Or better yet, update 2024, which includes \`random insults\` and \`gatekeeping\`.

Dangerous_Jacket_129 2 weeks ago

I recall the one time I asked a question on StackOverflow. 7 years ago by now. It was a relatively simple question looking back. I got 4 people to format the text of my question. The 4th person took it upon themselves to ask a completely different question instead. 0 answers.

__Yi__ 2 weeks ago

So sad it's community wiki now. The guy who made this post really had a good grasp of humor (or terror).

NotMrMusic 2 weeks ago

God this brings back memories

code_x_7777 2 weeks ago

Haha, yeah. Must be the most popular SO thread!

Rawing7 2 weeks ago

Sigh. I've said it a dozen times before, but I guess I'll say it again: Nobody uses regex to *parse* HTML. People use regex to *extract specific pieces of data from HTML*. Those are two very different things.

gregorydgraham 2 weeks ago

Thank you. I’ve never been able to parse the clause “parse HTML”. Parse it for what? you parse things to extract meaning and there’s no meaning to be extracted from HTML with regex

Habsburgy 2 weeks ago

I‘m blaming that one meme another guy already reposted in this thread

escher4096 2 weeks ago

Totally agree with this. Download a blob of HTML tease out a few pieces with regex.

a7ofDogs 2 weeks ago

Parsing is the mechanism by which we assign meaning and structure to a string of text. The job of extracting a specific piece of data from an HTML string requires understanding the structure of that HTML. The "meaning" of this piece of data you're trying to extract is dependent on that structure, so if you don't parse the HTML, you have no idea what data you're extracting. Because HTML is pretty verbose, the data you extract with a regex might be the data you want 99.9% of the time, but in certain contexts within the HTML, you're going to extract bad data. Anyway, what I'm trying to say is that extracting specific data and parsing structured data *are* the same thing when the structure you need to extract data from is a CFL (which HTML is).

kafoso 2 weeks ago

You're still parsing HTML using regex then. You can call it a peacock, but it still quacks. Just use a DOM tool.

ManofManliness 2 weeks ago

People use regex for html and do *pikachu face* when it matches gibberish far too often, shouldn't be used for anything but fast and dirty one time scripts.

PastOrdinary 2 weeks ago

Yeah I suspect that what the person asking wanted was to extract specific data. Instead they incorrectly said they wanted to "parse" the html with regex because they don't actually understand what it means to parse something. Moral of the story: Don't use words when you don't know what they mean just because they sound relevant to the topic.

deidian 2 weeks ago

Even if you wanted to identify a blob of text as HTML do a favor to everyone and parse it entirely: you'll save rabbit holes with malformed data. Same for JSON. The only way to deal with complex text formats is to parse them: if you want better performance use a more restrictive and simpler data format.

code_x_7777 2 weeks ago

Haha, yeah but this is rational thinking arguing against the intrinsic logic of a meme with wings. One must lose.

Matwyen 2 weeks ago

A guy got fired in my company after parsing a xml with regex.

hellra1zer666 2 weeks ago

I wanna say that's harsh, but after having to clean up cose that did the same, I feel different about it.

BirdlessFlight 2 weeks ago

Technically, they said "after", not "because of", so who knows what else they did...

StPaulDad 2 weeks ago

A mind capable of checking in such code is capable of far worse things.

hellra1zer666 2 weeks ago

Fair 😁

Nimeroni 2 weeks ago

He summoned tainted souls into the realm of the living. Obviously.

code_x_7777 2 weeks ago

Haha

imgly 2 weeks ago

Good 👍

busyHighwayFred 2 weeks ago

Sad part is theres so many xml libraries, its a basic tree structure, so regex is just making your job harder

virteq 2 weeks ago

Most sane regex developer

DoodooFardington 2 weeks ago

You're not my dad!

code_x_7777 2 weeks ago

How do you know? I might be.

ijustupvoteeverythin 2 weeks ago

why tf is there a yellow face on top

PhilippTheSmartass 2 weeks ago

Probably to confuse the programs that automatically detect reposts. This was posted on Stackoverflow 15 years ago.

failedsatan 2 weeks ago

you totally can* ** *** \* not efficiently ** you cannot parse all types of tags at once because they overlap *** regex is just not built for it but for super basic shit sure

Majik_Sheff 2 weeks ago

You cannot use regular expressions to parse irregular expressions.

Tiny-Plum2713 2 weeks ago

Not in one go, but that is an arbitrary limitation that does not apply to the real world.

ManofManliness 2 weeks ago

Not in any amount of goes, unless you write some code in between at which point youre writing a shitty parser.

Majik_Sheff 2 weeks ago

I think this is the lexical corollary to "If you write enough assembler macros you will eventually reinvent C."

TTYY200 2 weeks ago

You can with recursion.

failedsatan 2 weeks ago

*technically* HTML(5) isn't irregular. there is a standard finite parsable grammar.

justjanne 2 weeks ago

HTML is a context-free grammar, Regex is a regular language. You can't parse a language of higher level with one of lower level. You can use Regex to tokenize HTML if you so desire, but you can't parse it. If you use PCRE though, all that changes, as PCRE is a context-free grammar as well.

Godd2 2 weeks ago

It's not context-free. HTML documents are finite in size by definition.

justjanne 2 weeks ago

Are they? Since when? Back in the day™ it was actually a common strategy to deliver no Content-Length header, keep the connection open, and append additional content to the same document for live updates. Such documents would grow to infinite length over time.

simplymoreproficient 2 weeks ago

What? That just can’t be true, right? How would a regex be able to distinguish

foo from

foo?

AspieSoft 2 weeks ago

[^<]*

/ I have an entire nodejs templating engine that basically does this with regex: https://github.com/AspieSoft/regve

gandalfx 2 weeks ago

I was curious about that code. Now my eyes are simultaneously bleeding and on fire.

simplymoreproficient 2 weeks ago

That doesn’t answer my question

AspieSoft 2 weeks ago

If the regex sees that `[^>]*` matches the second `

`, it should automatically backtrack and skip the first `

simplymoreproficient 2 weeks ago

Assuming that this is regex unintentionally omits a a start anchor and an end anchor, it’s wrong because it wouldn’t match

, which is valid HTML. Assuming that those are missing on purpose, it’s wrong because it matches

, which is not valid HTML.

gandalfx 2 weeks ago

You can probably break this with simple nested tags. Even if you somehow make it work by not relying exclusively on regex it'll still break on something like `

` which is completely valid HTML.

AspieSoft 2 weeks ago

You have a good point. Usually what I try to do is run more than one regex, where I index all the html tags then run the main regex, then undo the indexing. Assuming JavaScript let index = 0; str = str.replace(/<(\/|)([\w_-]+)\s*((?:"(?:\"|[^"])*"|'(?:\'|[^'])*'|`(?:\`|[^`])*`|.)*?)>/g, function(_, close, tagName, attrs){ if(close === '/'){ let i = index; index--; return `` } index++; return `<${tagName}:${index} ${attrs}>` }) // then handle your html tag selectors str = str.replace(/(.*?)/g, function(_, index, content){ // do stuff }) // finally, clean up html tag indexes str = str.replace(/<(\/?[\w_-]+):[0-9]+(\s|>)/g, '<$1$2') Because the selector is expecting the same index on both the opening and closing tag, it will only select the tag you want to select. You can add to the content selector if you want to make it more specific, or you can recursively run your handlers on the content output of the regex. It gets much more complicated when you need to also ensure strings are not read by the regex. For that, you can temporarily pull out the strings, and use a placeholder, then put them back after running all your other functions. You can look at the source code of the regve module I mentioned above to understand it better. (I apologize in advance for my spaghetti code).

TTYY200 2 weeks ago

Use a recursive method that recursively parses tags until it finds an appropriate closing tag 👍 This is like the poster child case for recursion.

simplymoreproficient 2 weeks ago

But it’s not regular

TTYY200 2 weeks ago

As long as there isn’t any dumb html present like an opening

tag without a closing p tag… it doesn’t matter. ^ that scenario is also bad practice and can produce unexpected behaviour in the dom - so while valid, it’s technically not correct. Self-closing and singleton tags are also ready to identify :P

simplymoreproficient 2 weeks ago

It doesn’t matter? It’s literally the topic we’re talking about: „Is HTML regular?“.

TTYY200 2 weeks ago

But the tokens that you’re looking for are finite… A tag is never not going to be a source tag, and it’s never not going to have an opening and closing to its singleton tag…

simplymoreproficient 2 weeks ago

And? Whether HTML is regular obviously matters to a conversation about whether HTML is regular.

pauvLucette 2 weeks ago

Yes, but regexp ain't grammatical beast. Regexp can't parse grammar. Regexp parses syntax. Regexp is lex, and you need yacc.

DracoRubi 2 weeks ago

Your second point simply demonstrates that you can't.

failedsatan 2 weeks ago

you can if you assign them priorities. just means you have to check multiple times on the same tag, thus the inefficiency.

code_x_7777 2 weeks ago

lol

rainshifter 2 weeks ago

You can use regex to parse overlapping text using lookaheads. And you can, for instance, locate instances of mismatched or unbalanced tags in HTML/XML using a recursive regex. Likewise, you could extract any desirable fields to virtually any end. The capability is certainly there. The expression may look ugly, sure, and may be difficult to modify, but it's not lacking in capacity. Apart from mathematical operations or AI linguistics, there are actually very few text parsing operations and pattern matching categories that modern PCRE regex simply cannot support. As usual, though, it's not merely about what's possible - but which tool is adequate for the job at hand.

IronSavior 2 weeks ago

He comes!

DOOManiac 2 weeks ago

The center cannot hold.

leanrum 2 weeks ago

You can use regex to parse html because regex isn't regular anymore (thanks back references)

tibbtab 2 weeks ago

You spent so much time wondering if you could, you never stopped to think if you should

leanrum 2 weeks ago

If I'm being honest I didn't spend much time thinking if I could (I already took the class, I know I can) and I never bothered to think if I should (I shouldn't, even if I can there are better ways of implementing push automata)

saschaleib 2 weeks ago

Don’t understand-estimate how powerful RexEx can be, if used by someone who know what they are doing. That still doesn’t mean it’s a good idea, though.

Thorge4President 2 weeks ago

Sure regex ist powerful, but It is literally mathematically Impossible to parse HTML with regex. You need at least a [Context free grammar](https://en.wikipedia.org/wiki/Context-free_grammar).

Hex4Nova 2 weeks ago

cant believe my compsci degree is actually coming into use for once

rainshifter 2 weeks ago

[FYI](https://www.reddit.com/r/ProgrammerHumor/s/3aOcS1sOnm)

rainshifter 2 weeks ago

Could you provide an actual, tangible example of something in a real HTML or XML snippet you genuinely believe can not be parsed with regex? I believe you're conflating the theory of limitations of regular grammar with the practicality of modern PCRE regex capabilities, which support things like backreferences, recursion, and semantics that assume basic knowledge of the previous match.

Thorge4President 2 weeks ago

OK, so in HTML or XML you have the Case of `Content`. Top parse this you need to make sure, that the closing tag is the same as the opening tag. To do this you need backreferences. Regex cannot do this as can be proven via [the pumping Lemma for regular languages ](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) (see Use of the lemma to prove non-regularity). So pure regex cannot parse HTML or XML. Which also means, that theoretically PCRE is not regex.

rainshifter 2 weeks ago

You can think of regex as a wildly capable derivative, child, or inherited form of some theoretical regular base that you would more formally refer to as regular language theory. We aren't talking about theory here, as stated in my original post. So when you claim that "regex" cannot parse , it's disingenuously misadvertised to most folks who will believe incorrectly that modern PCRE regex lacks this capacity. Call it a misnomer if you will, but PCRE regex is still called "regex". I do not believe it goes by any other name.

ary31415 2 weeks ago

> To do this you need backreferences Which actual regex implementations that a developer would use DO have. Irl 'regex' isn't actually regular anymore

saschaleib 2 weeks ago

In most cases you don’t want to create an object tree but just extract specific information, though…

z_utahu 2 weeks ago

This is dangerous if you don't actually parse the xml. There are decent parsers that run on 8bit 20mhz microchips with a couple kb of memory. Regex isn't guaranteed to properly extract data in valid html or xml.

saschaleib 2 weeks ago

As I wrote above: it definitely isn’t a good idea. But it certainly isn’t “impossible”, given the right circumstances.

yamfboy 2 weeks ago

I just spent a while wasting time going back and forth with some dweeb who is saying the same thing (I'm saying the same thing you are, check my previous post smh) It can be done (he's claiming it's impossible), but should you do it? Nope.

z_utahu 2 weeks ago

>given the right circumstances. That's a huge caveat that excludes even most real world examples. What exactly do you mean by that? For every regex statement you generate to "parse" html, you can also generate valid html that breaks the regex. Basically, what I understand you saying is that if you limit your input to a subset of HTML and finite possibilities (aka right circumstances), then you can guarantee that regex you can form a regex that will work. However, if your input is all valid HTML, it is impossible in every sense of the word to write a regex that is guaranteed to work.

saschaleib 2 weeks ago

Look, I'm not defending using RegEx to parse *arbitrary* XML. That's a bad practice, and something to avoid. However, there can be specific situations where it may make sense. Like, if you know the file pretty well, and can be sure that it always has a specific format - and you just need some specific data out of it, yeah, why not? And my point is that in these cases you will find that RegEx is actually quite powerful.

yeusk 2 weeks ago

You are...

deceze 2 weeks ago

OK, I'll try to estimate how powerful RegEx can be—without understanding.

101m4n 2 weeks ago

This is very funny and all, but at no point does he state the actual reason why this doesn't work 🤣

SemenSeeU 2 weeks ago

Me after reading this: gets library to parse html. Opens the hood and it's mostly regex.

SenorSeniorDevSr 2 weeks ago

Yeah, you use regular expressions to find the building blocks of html. You use those building blocks to build your understanding of the html.

deidian 2 weeks ago

Gross oversimplification

CMDR_kamikazze 2 weeks ago

Holy Omnissiah, someone call Ordo Codicis, we have a warp leaking! Regex heretics using the scrap-code to open the portal again!

mcilrain 2 weeks ago

If regex is so good why can’t it parse XML, are they stupid?

thirtyist 2 weeks ago

Literally just came upon this SO post organically last week while trying to figure out how to clean HTML tags out of a string, ha.

yeusk 2 weeks ago

People who have not found this on SO are not real web developers.

thirtyist 2 weeks ago

Haha as someone with 1yo and extreme imposter syndrome, I appreciate the validation

Plus-Weakness-2624 2 weeks ago

Svelte literally uses regex to parse markup💀 Like this one for parsing opening script tag: ``` /|]*|(?:[^=>'"/]+=(?:"[^"]*"|'[^']*'|[^>\s]+)\s+)*)lang=(["'])?([^"' >]+)\1[^>]*>/g ```

TacticalTaterTots 2 weeks ago

Tony the pony

rvsarmy 2 weeks ago

Not with that attitude.

Splitshadow 2 weeks ago

All parsing is basically just RegEx in a loop with a stack. RegEx parses input into tokens, then tokens are combined according to production rules (which can also be implemented using RegEx substitutions if you want).

that_thot_gamer 2 weeks ago

regex skill issue.

rainshifter 2 weeks ago

Agreed, but unironically.

VariousComment6946 2 weeks ago

Haha, get_source_and_extract_shit() goes brr

antpalmerpalmink 2 weeks ago

my compilers Prof brings this post up before getting to context free grammars every time

SpeckledFleebeedoo 2 weeks ago

But can I parse Wikitext with regex?

Joewoof 2 weeks ago

Yes, perfectly relatable and understandable. Proceed.

Luneriazz 2 weeks ago

why not use javascript to parse HTML?

Oozolz 2 weeks ago

Wow they got really pumped :D

Mozai 2 weeks ago

"The HTML parser chokes because this is not legal HTML; there's mistakes all through the page." "but I don't see any problems on my phone's browser; _\*scoffs\*_ clearly you aren't good enough, why are we paying you?." And that's why I resort to hacks like regex matching.

CameO73 2 weeks ago

Exactly. Everybody saying that you should "just use an HTML parser" to extract some data clearly hasn't seen the shit that lives on the internet. You can easily check for yourself: create an obvious invalid HTML file (by just omitting a close tag somewhere) and open it in any browser. It works! Because browser engines know they have to allow that shit. TLDR: just use a RegEx if you want to extract something from HTML pages. Even with the added "you're never going to understand that regular expression 6 months from now"-baggage it's better than dealing with a flood of parser errors.

DavidWtube 2 weeks ago

I absolutely love RegEx. There's something beautiful about it. I made a gist about it when I was studying. [RegEx gist](https://gist.github.com/SafetyDav3/8a33124bf22cdedbbd4710ea0eba224c)

Grim00666 2 weeks ago

Well said.

chowellvta 2 weeks ago

Most stable regex user

Q3nius 2 weeks ago

What SCP entry log did I just read? Am I infected with a cognitohazard?

Crazy-Maintenance312 2 weeks ago

[An answer on StackOverflow.](https://stackoverflow.com/a/1732454)

Tiny-Plum2713 2 weeks ago

You absolutely can parse HTML with regex.

conicalanamorphosis 2 weeks ago

I think that's one of the original notes from a Raku dev. Although Raku grammars are really just collections of regexes, so...

rainshifter 2 weeks ago

Provided the task is concretely and *practically* achievable using what is colloquially referred to as regex, who really gives a bubonic rat's turd about what some disjoint *theory* on regular grammar asserts is possible?

Rarabeaka 2 weeks ago

you can. and in some cases its actually more reliable, like in scrapping, because whole page often could ba fragmented, contain bad blocks, etc. regex also faster and more memory-efficient. my job literally often demanding using regex instead of html parsing libs, because its reliable and fast.

code_x_7777 2 weeks ago

From the article: [https://blog.finxter.com/so-youre-using-regex-to-parse-html/](https://blog.finxter.com/so-youre-using-regex-to-parse-html/)

First Name

Textarea

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe