struct LinkExtractor<S: SpanProvider> {
span_provider: S,
links: Vec<RawUri>,
fragments: HashSet<String>,
include_verbatim: bool,
current_element: String,
current_attributes: HashMap<String, Spanned<String>>,
current_attribute_name: String,
verbatim_stack: Vec<String>,
in_style_tag: bool,
style_content: String,
style_content_offset: usize,
}Expand description
Extract links from HTML documents.
This is the main driver for the html5gum tokenizer.
It implements the Emitter trait, which is used by the tokenizer to
communicate with the caller.
The LinkExtractor keeps track of the current element being processed,
the current attribute being processed, and a bunch of plain characters
currently being processed.
The links vector contains all links extracted from the HTML document and
the fragments set contains all fragments extracted from the HTML document.
Fields§
§span_provider: SThe SpanProvider which will be used to compute spans for URIs.
This is generic, since e.g. the markdown parser has already started, so we have to compute the span location in relation to the offset in the outer document.
links: Vec<RawUri>Links extracted from the HTML document.
fragments: HashSet<String>Fragments extracted from the HTML document.
include_verbatim: boolWhether to include verbatim elements in the output.
current_element: StringCurrent element name being processed. This is called a tag in html5gum.
current_attributes: HashMap<String, Spanned<String>>Current attributes being processed. This is a list of key-value pairs (in order of appearance), where the key is the attribute name and the value is the attribute value.
current_attribute_name: StringCurrent attribute name being processed.
verbatim_stack: Vec<String>Element name of the current verbatim block. Used to keep track of nested verbatim blocks.
in_style_tag: boolWhether we’re currently inside a <style> tag.
style_content: StringAccumulated CSS content from within a <style> tag.
style_content_offset: usizeStart offset of the style tag content (for span calculation).
Implementations§
Source§impl<S: SpanProvider> LinkExtractor<S>
impl<S: SpanProvider> LinkExtractor<S>
Sourcefn new(span_provider: S, include_verbatim: bool) -> Self
fn new(span_provider: S, include_verbatim: bool) -> Self
Create a new LinkExtractor.
Set include_verbatim to true if you want to include verbatim
elements in the output.
Sourcefn extract_urls_from_elem_attr(&self) -> Vec<RawUri>
fn extract_urls_from_elem_attr(&self) -> Vec<RawUri>
Extract all semantically known links from a given HTML attribute.
Sourcefn filter_verbatim_here(&self) -> bool
fn filter_verbatim_here(&self) -> bool
Check if we should filter out links in the current context due to being inside a verbatim element.
Sourcefn flush_links(&mut self)
fn flush_links(&mut self)
Flush the current element and attribute values to the links vector.
This function is called whenever a new element is encountered or when the current element is closing. It extracts URLs from the current attribute value and adds them to the links vector.
Here are the rules for extracting links:
- If the current element has a
rel=nofollowattribute, the current attribute value is ignored. - If the current element has a
rel=preconnectorrel=dns-prefetchattribute, the current attribute value is ignored. - If the current attribute value is not a URL, it is treated as plain text and added to the links vector.
- If the current attribute name is
id, the current attribute value is added to the fragments set.
The current attribute name and value are cleared after processing.
Trait Implementations§
Source§impl<S: SpanProvider> Callback<(), usize> for &mut LinkExtractor<S>
impl<S: SpanProvider> Callback<(), usize> for &mut LinkExtractor<S>
Source§fn handle_event(
&mut self,
event: CallbackEvent<'_>,
span: Span<usize>,
) -> Option<()>
fn handle_event( &mut self, event: CallbackEvent<'_>, span: Span<usize>, ) -> Option<()>
Source§impl<S: Clone + SpanProvider> Clone for LinkExtractor<S>
impl<S: Clone + SpanProvider> Clone for LinkExtractor<S>
Source§fn clone(&self) -> LinkExtractor<S>
fn clone(&self) -> LinkExtractor<S>
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl<S> Freeze for LinkExtractor<S>where
S: Freeze,
impl<S> RefUnwindSafe for LinkExtractor<S>where
S: RefUnwindSafe,
impl<S> Send for LinkExtractor<S>where
S: Send,
impl<S> Sync for LinkExtractor<S>where
S: Sync,
impl<S> Unpin for LinkExtractor<S>where
S: Unpin,
impl<S> UnsafeUnpin for LinkExtractor<S>where
S: UnsafeUnpin,
impl<S> UnwindSafe for LinkExtractor<S>where
S: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more