jvanasco · jvanasco · Jun 11, 2025 · May 29, 2025 · May 29, 2025 · May 30, 2025
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -1,5 +1,70 @@
-1.0 (unreleased)
-    1.0 will include an api overhaul and remove all deprecations
+1.0.0 (unreleased)
+
+    IMPORTANT
+
+    This release has many breaking changes.
+
+    Deprecated legacy code was removed.
+
+    Work has been done to make the API more consistent.
+
+    Several long-standing bugs and inconsistencies were fixed.
+
+
+    Backwards Incompatible Changes:
+
+        Remove Deprecated Functions:
+            ``MetadataParser.get_metadata``
+            ``MetadataParser.get_metadatas``
+            ``MetadataParser.is_opengraph_minimum``
+            ``MetadataParser.metadata``
+            ``MetadataParser.metadata_encoding``
+            ``MetadataParser.metadata_version``
+            ``MetadataParser.soup``
+            ``ParsedResult.get_metadata``
+
+        Remove Deprecated Functionality:
+            ``MetadataParser.__init__::cached_urlparser``
+                    no longer accepts `int` to control `cached_urlparser_maxitems`
+
+        Encoder changes
+            affected functions:
+                ``decode_html``
+                ``encode_ascii``
+                ``ParsedResult.default_encoder``
+                ``ParsedResult.get_metadatas::encoder``
+                ``MetadataParser.__init__::default_encoder``
+            previously, encoders accepted one argument, which was documented to
+                be a string. This would cause issues if the elements were DC, as
+                that storage uses a dict. The new behavior is to send a first
+                raw input value that could either be a Dict or String, and a
+                second value that is a string identifiying the storage type.
+            now they accept two arguments:
+                Arg 1 is a string or dict
+                Arg 2 is optional string, identifying the strategy/store
+
+        API Changes
+            The package was split into namespaces.
+            ``MetadataParser.__init__`` now validates submitted `strategy` args
+
+            ``MetadataParser.strategy`` now defaults to: `["meta", "page", "og", "dc", "twitter"]`
+                previously this was: `["og", "dc", "meta", "page", "twitter"]`
+
+            ``ParsedResult.get_metadatas`` will now return a dict or None.
+                A bug was discovered in which it would return the first matched
+                elements when there were multiple options
+
+            An invalid strategy will now raise `InvalidStrategy`, a subclass of `ValueError`
+
+            `InvalidDocument` no longer has a .message attribute
+
+            Exceptions now invoke `super().__init__(args)`
+
+    New Functionality
+
+        ```ParsedResult.select_first_match(field, strategy)```
+            will return the first match for the given, or default strategy
+
 
 
 0.13.1

diff --git a/README.rst b/README.rst
@@ -7,10 +7,10 @@ Build Status: |build_status|
 
 MetadataParser is a Python module for pulling metadata out of web documents.
 
-It requires `BeautifulSoup` for parsing. `Requests` is required for installation
-at this time, but not for operation. Additional functionality is automatically
-enabled if the `tldextract` project is installed, but can be disabled by
-setting an environment variable.
+`BeautifulSoup` is required for parsing.
+`Requests` is required for fetching remote documents.
+`tldextract` is utilized to parse domains, but can be disabled by setting an
+environment variable.
 
 This project has been used in production for many years, and has successfully
 parsed billions of documents.
@@ -29,20 +29,19 @@ For example:
 * if the current release is: `0.10.6`
 * the advised pin is:  `metadata_parser<0.11`
 
-PATCH releases will usually be bug fixes and new features that support backwards compatibility with Public Methods.  Private Methods are not guaranteed to be
+PATCH releases will usually be bug fixes and new features that support backwards
+compatibility with Public Methods.  Private Methods are not guaranteed to be
 backwards compatible.
 
 MINOR releases are triggered when there is a breaking change to Public Methods.
 Once a new MINOR release is triggered, first-party support for the previous MINOR
 release is EOL (end of life). PRs for previous releases are welcome, but giving
 them proper attention is not guaranteed.
 
-The current MAJOR release is `0`.
-A `1` MAJOR release is planned, and will have an entirely different structure and API.
-
 Future deprecations will raise warnings.
 
 By populating the following environment variable, future deprecations will raise exceptions:
+
     export METADATA_PARSER_FUTURE=1
 
 Installation
@@ -74,7 +73,7 @@ Features
 Logging
 =======
 
-This file has extensive logging to help developers pinpoint problems.
+This file utilizes extensive logging to help developers pinpoint problems.
 
 * ``log.debug``
   This log level is mostly used to handle library maintenance and
@@ -109,7 +108,8 @@ Optional Integrations
 
 * ``tldextract``
   This package will attempt to use the package ``tldextract`` for advanced domain
-  and hostname analysis. If ``tldextract`` is not found, a fallback is used.
+  and hostname analysis. If ``tldextract`` is not wanted, it can be disabled
+  with an environment variable.
 
 
 Environment Variables
@@ -132,7 +132,7 @@ Notes
 
 1. This package requires BeautifulSoup 4.
 2. For speed, it will instantiate a BeautifulSoup parser with lxml, and
-   fallback to 'none' (the internal pure Python) if it can't load lxml.
+   fallback to 'None' (the internal pure Python) if it can not load lxml.
 3. URL Validation is not RFC compliant, but tries to be "Real World" compliant.
 
 It is HIGHLY recommended that you install lxml for usage.
@@ -145,7 +145,7 @@ Using at least the most recent 3.x versions is strongly recommended
 
 The default 'strategy' is to look in this order::
 
-    og,dc,meta,page
+    meta,page,og,dc,
 
 Which stands for the following::
 
@@ -239,27 +239,27 @@ is extracted from the metadata payload::
 
     >>> import metadata_parser
     >>> page = metadata_parser.MetadataParser(url="http://www.example.com")
-    >>> print page.get_metadata_link('image')
+    >>> print(page.get_metadata_link('image'))
 
 This method accepts a kwarg ``allow_encoded_uri`` (default False) which will
 return the image without further processing::
 
-    >>> print page.get_metadata_link('image', allow_encoded_uri=True)
+    >>> print(page.get_metadata_link('image', allow_encoded_uri=True))
 
 Similarly, if a url is local::
 
     <meta property="og:image" content="/image.jpg" />
 
 The ``get_metadata_link`` method will automatically upgrade it onto the domain::
 
-    >>> print page.get_metadata_link('image')
+    >>> print(page.get_metadata_link('image'))
     http://example.com/image.jpg
 
 Poorly Constructed Canonical URLs
 ---------------------------------
 
-Many website publishers implement canonical URLs incorrectly.  This package
-tries to fix that.
+Many website publishers implement canonical URLs incorrectly.
+This package tries to fix that.
 
 By default ``MetadataParser`` is constructed with ``require_public_netloc=True``
 and ``allow_localhosts=True``.
@@ -298,17 +298,17 @@ improper canonical url, and remount the local part "/alt-path/to/foo" onto the
 domain that served the file.  The vast majority of times this 'behavior'
 has been encountered, this is the intended canonical::
 
-    print page.get_discrete_url()
+    print(page.get_discrete_url())
     >>> http://example.com/alt-path/to/foo
 
 In contrast, versions 0.8.3 and earlier will not catch this situation::
 
-    print page.get_discrete_url()
+    print(page.get_discrete_url())
     >>> http://localhost:8000/alt-path/to/foo
 
 In order to preserve the earlier behavior, just submit ``require_public_global=False``::
 
-    print page.get_discrete_url(require_public_global=False)
+    print(page.get_discrete_url(require_public_global=False))
     >>> http://localhost:8000/alt-path/to/foo
 
 
@@ -340,43 +340,7 @@ content, not just templates/Site-Operators.
 WARNING
 =============
 
-1.0 will be a complete API overhaul.  pin your releases to avoid sadness.
-
-
-Version 0.9.19 Breaking Changes
-===============================
-
-Issue #12 exposed some flaws in the existing package
-
-1. ``MetadataParser.get_metadatas`` replaces ``MetadataParser.get_metadata``
-----------------------------------------------------------------------------
-
-Until version 0.9.19, the recommended way to get metadata was to use
-``get_metadata`` which will either return a string (or None).
-
-Starting with version 0.9.19, the recommended way to get metadata is to use
-``get_metadatas`` which will always return a list (or None).
-
-This change was made because the library incorrectly stored a single metadata
-key value when there were duplicates.
-
-2. The ``ParsedResult`` payload stores mixed content and tracks it's version
-==--------------------------------------------------------------------------
-
-Many users (including the maintainer) archive the parsed metadata. After
-testing a variety of payloads with an all-list format and a mixed format
-(string or list), a mixed format had a much smaller payload size with a
-negligible performance hit. A new ``_v`` attribute tracks the payload version.
-In the future, payloads without a ``_v`` attribute will be interpreted as the
-pre-versioning format.
-
-3. ``DublinCore`` payloads might be a dict
-------------------------------------------
-
-Tests were added to handle dublincore data. An extra attribute may be needed to
-properly represent the payload, so always returning a dict with at least a
-name+content (and possibly ``lang`` or ``scheme`` is the best approach.
-
+Please pin your releases.
 
 
 Usage
@@ -389,19 +353,19 @@ Until version ``0.9.19``, the recommended way to get metadata was to use
 
     >>> import metadata_parser
     >>> page = metadata_parser.MetadataParser(url="http://www.example.com")
-    >>> print page.metadata
-    >>> print page.get_metadatas('title')
-    >>> print page.get_metadatas('title', strategy=['og',])
-    >>> print page.get_metadatas('title', strategy=['page', 'og', 'dc',])
+    >>> print(page.metadata)
+    >>> print(page.get_metadatas('title'))
+    >>> print(page.get_metadatas('title', strategy=['og',]))
+    >>> print(page.get_metadatas('title', strategy=['page', 'og', 'dc',]))
 
 **From HTML**::
 
     >>> HTML = """<here>"""
     >>> page = metadata_parser.MetadataParser(html=HTML)
-    >>> print page.metadata
-    >>> print page.get_metadatas('title')
-    >>> print page.get_metadatas('title', strategy=['og',])
-    >>> print page.get_metadatas('title', strategy=['page', 'og', 'dc',])
+    >>> print(page.metadata)
+    >>> print(page.get_metadatas('title'))
+    >>> print(page.get_metadatas('title', strategy=['og',]))
+    >>> print(page.get_metadatas('title', strategy=['page', 'og', 'dc',]))
 
 
 Malformed Data
@@ -428,4 +392,4 @@ when building on Python3, a ``static`` toplevel directory may be needed
 
 This library was originally based on Erik River's
 `opengraph module <https://github.com/erikriver/opengraph>`_. Something more
-aggressive than Erik's module was needed, so this project was started.
+aggressive than Erik's module was needed, so this project was started.
diff --git a/TODO.txt b/TODO.txt
@@ -0,0 +1,5 @@
+1.0.0
+    tests needed for:
+        select_first_strategy
+            try to break it
+            select different strategies, different data on each
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.black]
 line-length = 88
-target-version = ['py36']
+target-version = ['py37']
 exclude = '''
 (
     /(

diff --git a/pytest.ini b/pytest.ini
@@ -1,5 +1,3 @@
 [pytest]
 
 filterwarnings =
-    ignore:MetadataParser.
-    ignore:`ParsedResult.get_metadata` returns a string
diff --git a/setup.cfg b/setup.cfg
@@ -1,13 +1,16 @@
 [flake8]
+application_import_names = metadata_parser
+import_order_style = appnexus
+exclude = .eggs/*, .pytest_cache/*, .tox/*, build/*, dist/*, workspace-demos/*
+max_line_length = 88
+
 # ignore = E402,E501,W503
 # E501: line too long
 # F401: imported but unused
 # I202: Additional newline in a group of imports
 per-file-ignores =
-	setup.py: E501
-	src/metadata_parser/__init__.py: E501,I202
+	setup.py:
+	src/metadata_parser/__init__.py: E501
+	src/metadata_parser/regex.py: E501
 	tests/*: E501
 	tests/_compat.py: F401
-exclude = .eggs/*, .pytest_cache/*, .tox/*, build/*, dist/*, workspace-demos/*
-application_import_names = metadata_parser
-import_order_style = appnexus
diff --git a/setup.py b/setup.py
@@ -32,8 +32,6 @@
     "requests-toolbelt>=0.8.0",
     "typing_extensions",
 ]
-if sys.version_info.major == 2:
-    requires.append("backports.html")
 
 if sys.version_info >= (3, 13):
     requires.append("legacy-cgi")