Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 67 additions & 2 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,70 @@
1.0 (unreleased)
1.0 will include an api overhaul and remove all deprecations
1.0.0 (unreleased)

IMPORTANT

This release has many breaking changes.

Deprecated legacy code was removed.

Work has been done to make the API more consistent.

Several long-standing bugs and inconsistencies were fixed.


Backwards Incompatible Changes:

Remove Deprecated Functions:
``MetadataParser.get_metadata``
``MetadataParser.get_metadatas``
``MetadataParser.is_opengraph_minimum``
``MetadataParser.metadata``
``MetadataParser.metadata_encoding``
``MetadataParser.metadata_version``
``MetadataParser.soup``
``ParsedResult.get_metadata``

Remove Deprecated Functionality:
``MetadataParser.__init__::cached_urlparser``
no longer accepts `int` to control `cached_urlparser_maxitems`

Encoder changes
affected functions:
``decode_html``
``encode_ascii``
``ParsedResult.default_encoder``
``ParsedResult.get_metadatas::encoder``
``MetadataParser.__init__::default_encoder``
previously, encoders accepted one argument, which was documented to
be a string. This would cause issues if the elements were DC, as
that storage uses a dict. The new behavior is to send a first
raw input value that could either be a Dict or String, and a
second value that is a string identifiying the storage type.
now they accept two arguments:
Arg 1 is a string or dict
Arg 2 is optional string, identifying the strategy/store

API Changes
The package was split into namespaces.
``MetadataParser.__init__`` now validates submitted `strategy` args

``MetadataParser.strategy`` now defaults to: `["meta", "page", "og", "dc", "twitter"]`
previously this was: `["og", "dc", "meta", "page", "twitter"]`

``ParsedResult.get_metadatas`` will now return a dict or None.
A bug was discovered in which it would return the first matched
elements when there were multiple options

An invalid strategy will now raise `InvalidStrategy`, a subclass of `ValueError`

`InvalidDocument` no longer has a .message attribute

Exceptions now invoke `super().__init__(args)`

New Functionality

```ParsedResult.select_first_match(field, strategy)```
will return the first match for the given, or default strategy



0.13.1
Expand Down
96 changes: 30 additions & 66 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ Build Status: |build_status|

MetadataParser is a Python module for pulling metadata out of web documents.

It requires `BeautifulSoup` for parsing. `Requests` is required for installation
at this time, but not for operation. Additional functionality is automatically
enabled if the `tldextract` project is installed, but can be disabled by
setting an environment variable.
`BeautifulSoup` is required for parsing.
`Requests` is required for fetching remote documents.
`tldextract` is utilized to parse domains, but can be disabled by setting an
environment variable.

This project has been used in production for many years, and has successfully
parsed billions of documents.
Expand All @@ -29,20 +29,19 @@ For example:
* if the current release is: `0.10.6`
* the advised pin is: `metadata_parser<0.11`

PATCH releases will usually be bug fixes and new features that support backwards compatibility with Public Methods. Private Methods are not guaranteed to be
PATCH releases will usually be bug fixes and new features that support backwards
compatibility with Public Methods. Private Methods are not guaranteed to be
backwards compatible.

MINOR releases are triggered when there is a breaking change to Public Methods.
Once a new MINOR release is triggered, first-party support for the previous MINOR
release is EOL (end of life). PRs for previous releases are welcome, but giving
them proper attention is not guaranteed.

The current MAJOR release is `0`.
A `1` MAJOR release is planned, and will have an entirely different structure and API.

Future deprecations will raise warnings.

By populating the following environment variable, future deprecations will raise exceptions:

export METADATA_PARSER_FUTURE=1

Installation
Expand Down Expand Up @@ -74,7 +73,7 @@ Features
Logging
=======

This file has extensive logging to help developers pinpoint problems.
This file utilizes extensive logging to help developers pinpoint problems.

* ``log.debug``
This log level is mostly used to handle library maintenance and
Expand Down Expand Up @@ -109,7 +108,8 @@ Optional Integrations

* ``tldextract``
This package will attempt to use the package ``tldextract`` for advanced domain
and hostname analysis. If ``tldextract`` is not found, a fallback is used.
and hostname analysis. If ``tldextract`` is not wanted, it can be disabled
with an environment variable.


Environment Variables
Expand All @@ -132,7 +132,7 @@ Notes

1. This package requires BeautifulSoup 4.
2. For speed, it will instantiate a BeautifulSoup parser with lxml, and
fallback to 'none' (the internal pure Python) if it can't load lxml.
fallback to 'None' (the internal pure Python) if it can not load lxml.
3. URL Validation is not RFC compliant, but tries to be "Real World" compliant.

It is HIGHLY recommended that you install lxml for usage.
Expand All @@ -145,7 +145,7 @@ Using at least the most recent 3.x versions is strongly recommended

The default 'strategy' is to look in this order::

og,dc,meta,page
meta,page,og,dc,

Which stands for the following::

Expand Down Expand Up @@ -239,27 +239,27 @@ is extracted from the metadata payload::

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.example.com")
>>> print page.get_metadata_link('image')
>>> print(page.get_metadata_link('image'))

This method accepts a kwarg ``allow_encoded_uri`` (default False) which will
return the image without further processing::

>>> print page.get_metadata_link('image', allow_encoded_uri=True)
>>> print(page.get_metadata_link('image', allow_encoded_uri=True))

Similarly, if a url is local::

<meta property="og:image" content="/image.jpg" />

The ``get_metadata_link`` method will automatically upgrade it onto the domain::

>>> print page.get_metadata_link('image')
>>> print(page.get_metadata_link('image'))
http://example.com/image.jpg

Poorly Constructed Canonical URLs
---------------------------------

Many website publishers implement canonical URLs incorrectly. This package
tries to fix that.
Many website publishers implement canonical URLs incorrectly.
This package tries to fix that.

By default ``MetadataParser`` is constructed with ``require_public_netloc=True``
and ``allow_localhosts=True``.
Expand Down Expand Up @@ -298,17 +298,17 @@ improper canonical url, and remount the local part "/alt-path/to/foo" onto the
domain that served the file. The vast majority of times this 'behavior'
has been encountered, this is the intended canonical::

print page.get_discrete_url()
print(page.get_discrete_url())
>>> http://example.com/alt-path/to/foo

In contrast, versions 0.8.3 and earlier will not catch this situation::

print page.get_discrete_url()
print(page.get_discrete_url())
>>> http://localhost:8000/alt-path/to/foo

In order to preserve the earlier behavior, just submit ``require_public_global=False``::

print page.get_discrete_url(require_public_global=False)
print(page.get_discrete_url(require_public_global=False))
>>> http://localhost:8000/alt-path/to/foo


Expand Down Expand Up @@ -340,43 +340,7 @@ content, not just templates/Site-Operators.
WARNING
=============

1.0 will be a complete API overhaul. pin your releases to avoid sadness.


Version 0.9.19 Breaking Changes
===============================

Issue #12 exposed some flaws in the existing package

1. ``MetadataParser.get_metadatas`` replaces ``MetadataParser.get_metadata``
----------------------------------------------------------------------------

Until version 0.9.19, the recommended way to get metadata was to use
``get_metadata`` which will either return a string (or None).

Starting with version 0.9.19, the recommended way to get metadata is to use
``get_metadatas`` which will always return a list (or None).

This change was made because the library incorrectly stored a single metadata
key value when there were duplicates.

2. The ``ParsedResult`` payload stores mixed content and tracks it's version
==--------------------------------------------------------------------------

Many users (including the maintainer) archive the parsed metadata. After
testing a variety of payloads with an all-list format and a mixed format
(string or list), a mixed format had a much smaller payload size with a
negligible performance hit. A new ``_v`` attribute tracks the payload version.
In the future, payloads without a ``_v`` attribute will be interpreted as the
pre-versioning format.

3. ``DublinCore`` payloads might be a dict
------------------------------------------

Tests were added to handle dublincore data. An extra attribute may be needed to
properly represent the payload, so always returning a dict with at least a
name+content (and possibly ``lang`` or ``scheme`` is the best approach.

Please pin your releases.


Usage
Expand All @@ -389,19 +353,19 @@ Until version ``0.9.19``, the recommended way to get metadata was to use

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.example.com")
>>> print page.metadata
>>> print page.get_metadatas('title')
>>> print page.get_metadatas('title', strategy=['og',])
>>> print page.get_metadatas('title', strategy=['page', 'og', 'dc',])
>>> print(page.metadata)
>>> print(page.get_metadatas('title'))
>>> print(page.get_metadatas('title', strategy=['og',]))
>>> print(page.get_metadatas('title', strategy=['page', 'og', 'dc',]))

**From HTML**::

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_metadatas('title')
>>> print page.get_metadatas('title', strategy=['og',])
>>> print page.get_metadatas('title', strategy=['page', 'og', 'dc',])
>>> print(page.metadata)
>>> print(page.get_metadatas('title'))
>>> print(page.get_metadatas('title', strategy=['og',]))
>>> print(page.get_metadatas('title', strategy=['page', 'og', 'dc',]))


Malformed Data
Expand All @@ -428,4 +392,4 @@ when building on Python3, a ``static`` toplevel directory may be needed

This library was originally based on Erik River's
`opengraph module <https://github.com/erikriver/opengraph>`_. Something more
aggressive than Erik's module was needed, so this project was started.
aggressive than Erik's module was needed, so this project was started.
5 changes: 5 additions & 0 deletions TODO.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1.0.0
tests needed for:
select_first_strategy
try to break it
select different strategies, different data on each
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.black]
line-length = 88
target-version = ['py36']
target-version = ['py37']
exclude = '''
(
/(
Expand Down
2 changes: 0 additions & 2 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
[pytest]

filterwarnings =
ignore:MetadataParser.
ignore:`ParsedResult.get_metadata` returns a string
13 changes: 8 additions & 5 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
[flake8]
application_import_names = metadata_parser
import_order_style = appnexus
exclude = .eggs/*, .pytest_cache/*, .tox/*, build/*, dist/*, workspace-demos/*
max_line_length = 88

# ignore = E402,E501,W503
# E501: line too long
# F401: imported but unused
# I202: Additional newline in a group of imports
per-file-ignores =
setup.py: E501
src/metadata_parser/__init__.py: E501,I202
setup.py:
src/metadata_parser/__init__.py: E501
src/metadata_parser/regex.py: E501
tests/*: E501
tests/_compat.py: F401
exclude = .eggs/*, .pytest_cache/*, .tox/*, build/*, dist/*, workspace-demos/*
application_import_names = metadata_parser
import_order_style = appnexus
2 changes: 0 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
"requests-toolbelt>=0.8.0",
"typing_extensions",
]
if sys.version_info.major == 2:
requires.append("backports.html")

if sys.version_info >= (3, 13):
requires.append("legacy-cgi")
Expand Down
Loading