Skip to main content

LinkExtractor

Struct LinkExtractor 

Source
struct LinkExtractor<S: SpanProvider> {
    span_provider: S,
    links: Vec<RawUri>,
    fragments: HashSet<String>,
    include_verbatim: bool,
    current_element: String,
    current_attributes: HashMap<String, Spanned<String>>,
    current_attribute_name: String,
    verbatim_stack: Vec<String>,
    in_style_tag: bool,
    style_content: String,
    style_content_offset: usize,
}
Expand description

Extract links from HTML documents.

This is the main driver for the html5gum tokenizer. It implements the Emitter trait, which is used by the tokenizer to communicate with the caller.

The LinkExtractor keeps track of the current element being processed, the current attribute being processed, and a bunch of plain characters currently being processed.

The links vector contains all links extracted from the HTML document and the fragments set contains all fragments extracted from the HTML document.

Fields§

§span_provider: S

The SpanProvider which will be used to compute spans for URIs.

This is generic, since e.g. the markdown parser has already started, so we have to compute the span location in relation to the offset in the outer document.

§links: Vec<RawUri>

Links extracted from the HTML document.

§fragments: HashSet<String>

Fragments extracted from the HTML document.

§include_verbatim: bool

Whether to include verbatim elements in the output.

§current_element: String

Current element name being processed. This is called a tag in html5gum.

§current_attributes: HashMap<String, Spanned<String>>

Current attributes being processed. This is a list of key-value pairs (in order of appearance), where the key is the attribute name and the value is the attribute value.

§current_attribute_name: String

Current attribute name being processed.

§verbatim_stack: Vec<String>

Element name of the current verbatim block. Used to keep track of nested verbatim blocks.

§in_style_tag: bool

Whether we’re currently inside a <style> tag.

§style_content: String

Accumulated CSS content from within a <style> tag.

§style_content_offset: usize

Start offset of the style tag content (for span calculation).

Implementations§

Source§

impl<S: SpanProvider> LinkExtractor<S>

Source

fn new(span_provider: S, include_verbatim: bool) -> Self

Create a new LinkExtractor.

Set include_verbatim to true if you want to include verbatim elements in the output.

Source

fn extract_urls_from_elem_attr(&self) -> Vec<RawUri>

Extract all semantically known links from a given HTML attribute.

Source

fn filter_verbatim_here(&self) -> bool

Check if we should filter out links in the current context due to being inside a verbatim element.

Flush the current element and attribute values to the links vector.

This function is called whenever a new element is encountered or when the current element is closing. It extracts URLs from the current attribute value and adds them to the links vector.

Here are the rules for extracting links:

  • If the current element has a rel=nofollow attribute, the current attribute value is ignored.
  • If the current element has a rel=preconnect or rel=dns-prefetch attribute, the current attribute value is ignored.
  • If the current attribute value is not a URL, it is treated as plain text and added to the links vector.
  • If the current attribute name is id, the current attribute value is added to the fragments set.

The current attribute name and value are cleared after processing.

Trait Implementations§

Source§

impl<S: SpanProvider> Callback<(), usize> for &mut LinkExtractor<S>

Source§

fn handle_event( &mut self, event: CallbackEvent<'_>, span: Span<usize>, ) -> Option<()>

Perform some action on a parsing event, and, optionally, return a value that can be yielded from the crate::Tokenizer iterator.
Source§

impl<S: Clone + SpanProvider> Clone for LinkExtractor<S>

Source§

fn clone(&self) -> LinkExtractor<S>

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<S: Debug + SpanProvider> Debug for LinkExtractor<S>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<S> Freeze for LinkExtractor<S>
where S: Freeze,

§

impl<S> RefUnwindSafe for LinkExtractor<S>
where S: RefUnwindSafe,

§

impl<S> Send for LinkExtractor<S>
where S: Send,

§

impl<S> Sync for LinkExtractor<S>
where S: Sync,

§

impl<S> Unpin for LinkExtractor<S>
where S: Unpin,

§

impl<S> UnsafeUnpin for LinkExtractor<S>
where S: UnsafeUnpin,

§

impl<S> UnwindSafe for LinkExtractor<S>
where S: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> Pointable for T

§

const ALIGN: usize

The alignment of pointer.
§

type Init = T

The type for initializers.
§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more