Extract Content Between HTML Tags Regex

Extract text content between matching HTML opening and closing tags using regex with lookbehind and lookahead assertions.

Pattern Breakdown

regex

(?<=<(\w+)>).*(?=<\/\1>)

Components

Component	Description	Matches
`(?<=<(\w+)>)`	Positive lookbehind	Asserts opening tag before content
`<(\w+)>`	Opening tag	Captures tag name in group 1
`.*`	Content	Any characters (greedy)
`(?=<\/\1>)`	Positive lookahead	Asserts matching closing tag after content
`<\/\1>`	Closing tag	Uses backreference `\1` to match tag name

Detailed Breakdown

(?<=<(\w+)>) - Positive lookbehind:
- (?<=...) - Lookbehind assertion (not included in match)
- <(\w+)> - Opening tag with captured tag name
.* - Matches any content (greedy, matches as much as possible)
(?=<\/\1>) - Positive lookahead:
- (?=...) - Lookahead assertion (not included in match)
- <\/\1> - Closing tag using backreference

Examples

Matches:

Input: <div>Hello World</div> → Matches: Hello World
Input: <p>This is a paragraph</p> → Matches: This is a paragraph
Input: <span class="test">Content here</span> → Matches: Content here
Input: <h1>Title</h1> → Matches: Title

Doesn't Match:

<div>content</span> (mismatched tags)
<div>content (no closing tag)
content</div> (no opening tag)
<div><span>nested</span></div> (only matches inner content if pattern matches span)

Implementation

JavaScript

javascript

// Note: JavaScript doesn't support lookbehind with variable length
// Use alternative approach with capturing groups
const extractContentRegex = /<(\w+)>(.*?)<\/\1>/;
const match = '<div>Hello World</div>'.match(extractContentRegex);
if (match) {
  console.log(match[2]); // "Hello World"
}

// For environments supporting lookbehind (ES2018+)
const extractRegex = /(?<=<(\w+)>).*(?=<\/\1>)/;
const result = '<div>Hello World</div>'.match(extractRegex);

Python

python

import re
extract_regex = r'(?<=<(\w+)>).*(?=<\/\1>)'
match = re.search(extract_regex, '<div>Hello World</div>')
if match:
    print(match.group(0))  # "Hello World"

# Alternative with capturing groups
extract_alt = r'<(\w+)>(.*?)<\/\1>'
match = re.search(extract_alt, '<div>Hello World</div>')
if match:
    print(match.group(2))  # "Hello World"

Go

import "regexp"

// Go supports lookahead but lookbehind support is limited
// Use capturing groups instead
extractRegex := regexp.MustCompile(`<(\w+)>(.*?)</\1>`)
match := extractRegex.FindStringSubmatch("<div>Hello World</div>")
if len(match) > 2 {
    content := match[2] // "Hello World"
}

Limitations

Lookbehind support: Not all regex engines support variable-length lookbehind
Greedy matching: .* is greedy and may match too much with nested tags
No nested tags: Doesn't properly handle nested HTML structures
No attributes: Doesn't account for attributes in opening tags
Single match: Only matches first occurrence (use global flag for all)
No whitespace handling: Doesn't trim whitespace automatically

When to Use

Extracting content from simple HTML tags
When you know the tag structure
Quick content extraction
Simple HTML parsing tasks
When tags are not nested

For production, consider:

Using HTML parsers (DOMParser, BeautifulSoup, etc.)
Using non-greedy quantifier .*? for better matching
Handling nested tags properly
Supporting attributes in opening tags
Using proper HTML parsing libraries

Extract Content Between HTML Tags Regex

Need to generate a regex pattern?

Extract Content Between HTML Tags Regex

Pattern Breakdown

Components

Detailed Breakdown

Examples

Implementation

JavaScript

Python

Go

Limitations

When to Use

Need to generate a regex pattern?