Aprimora o tratamento de erros robusto no pipeline de ingestão de artigos#1401
Conversation
…er e request_pid - Adiciona exceções XMLException e UnableToRegisterPIDError - Adiciona cached_property xml_with_pre com fallback (pid_provider_xml -> file -> url) - Simplifica sps_pkg_name para usar xml_with_pre - Refatora request_xml: remove flag force_update, usa uri em vez de path, lança XMLException em vez de marcar status internamente - Adiciona parâmetros auto_solve_pid_conflict e force_update em create/create_or_update - Substitui complete_data por add_pid_provider com tratamento de erros separado por tipo (XMLException, RequestXMLException, Exception) - Substitui get_or_create_pid_v3 por request_pid que lança UnableToRegisterPIDError - Adiciona validação de pid_provider_xml após obter v3 - Corrige is_completed para tratar exceção ao acessar xml_with_pre
…onflict - task_select_articles_to_complete_data: corrige verificação pp_xml e parâmetro na chamada de task - task_load_article_from_xml_url: remove chamada separada de complete_data, integra no create_or_update via add_pid_provider - task_select_articles_to_load_from_article_source: substitui complete_data por add_pid_provider - Propaga auto_solve_pid_conflict para task_load_article_from_pp_xml
…ion e corrige create_or_update - Adiciona parâmetro auto_solve_pid_conflict em create_or_update - Adiciona logging em add_pid_provider para rastrear etapas (request_xml, request_pid, status final e erros) - Adiciona autocomplete_label e autocomplete_custom_queryset_filter em ArticleAffiliation com busca por raw_text, raw_institution_name, raw_country_name, raw_state_name e raw_city_name
…e_pid_conflict redundante - Adiciona logging em task_load_article_from_pp_xml para rastrear carregamento de artigo - Altera task_load_article_from_pp_xml de .delay() para chamada síncrona em task_load_article_from_xml_url - Remove parâmetro auto_solve_pid_conflict de task_load_article_from_pp_xml e task_select_articles_to_load_from_article_source - Adiciona logging de exceção em task_check_article_availability
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Permite chamada com pid_v3, pid_v2 ou partial_pid_v2 de forma independente, sem duplicar pid_v3 como arg posicional e em params.
…for_opac_and_am_xml Remove bloco try/except com DoesNotExist e unifica busca por pid_v3/pid_v2.
- Remove método complete_data (não utilizado) - Substitui queries diretas por get_by_pid_v3 em is_pp_xml_valid e register_pid - Corrige typo logoging -> logging em ArticleSource.is_completed
- Substitui try/except DoesNotExist por chamada ao get_by_pid_v3 - Corrige nome do parâmetro pp_xml -> pp_xml_id na chamada do delay
- Resolve v3 para pp_xml antes do bloco try - Remove branch elif v3 redundante - Corrige indentação do break no loop de file_path
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: add non-root user with configurable UID in local Dockerfile * fix: get_url_logo returns file url directly, skipping rendition generation
…cieloorg#1369) * Initial plan * Fix AttributeError: 'Issue' object has no attribute 'code_sections' Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Add unit tests for _format_code_sections using table_of_contents Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Update issue/test_format_code_sections.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Use select_related to avoid N+1 queries in _format_code_sections Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_one_collection (scieloorg#1371) * Initial plan * Add journal_issn_list parameter to load_journal_from_article_meta_for_one_collection Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Use journal_issn_list directly to fetch metadata instead of paginating full collection Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Update journal/sources/article_meta.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update journal/sources/article_meta.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix review feedback: pid__in filter, offset increment, blank line, collection_acron validation Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
) * Initial plan * Add crossmark update policy management: new fields, UpdatePolicy model, choices, and migration Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Cria a funcionalidade de gerir os dados da página de política de atualização (crossmark) (scieloorg#1381) * chore(journal): remove migração 0057 gerada para crossmark update policy Remove a migration 0057 que adicionava os campos crossmark_policy_doi e crossmark_doi_is_active diretamente no modelo Journal e criava o modelo UpdatePolicy. Esses campos e modelo serão substituídos pela nova abordagem com CrossmarkPolicy. * feat(journal): adiciona migração para criação do modelo CrossmarkPolicy Cria o modelo CrossmarkPolicy com os campos doi, is_active, url, language, journal (ParentalKey com CASCADE), creator e updated_by. Substitui a abordagem anterior que armazenava crossmark_policy_doi e crossmark_doi_is_active diretamente no modelo Journal. * refactor(journal): substitui UpdatePolicy por CrossmarkPolicy e converte campo para property - Renomeia modelo UpdatePolicy para CrossmarkPolicy com campos doi, is_active e url - Remove campos crossmark_policy_doi e crossmark_doi_is_active do modelo Journal - Adiciona property crossmark_doi_is_active em Journal que retorna True se houver alguma CrossmarkPolicy ativa (is_active=True) - Altera related_name de update_policy para crossmark_policy - Altera on_delete de SET_NULL para CASCADE e torna journal e url obrigatórios * refactor(journal): atualiza proxys para usar crossmark_policy no lugar de update_policy - Substitui InlinePanel update_policy por crossmark_policy em JournalProxyEditor e JournalProxyPanelPolicy - Remove FieldPanels crossmark_policy_doi e crossmark_doi_is_active, agora gerenciados via CrossmarkPolicy inline --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
…rg#1383) * Initial plan * Replace Journal.doi_prefix CharField with FK to CrossRefConfiguration Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Rename migration 0057 to 0058 and update dependency to 0057_crossmarkpolicy Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
* Initial plan * Add translation tags to base.html and include templates Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Add translation tags to journal_page.html Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Add missing translation tag to about.html Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Fix footer translation to use blocktrans for better context Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
…1391) * Initial plan * Add official__country filter to Journal admin list_filter Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
* Initial plan * Add admin area for Journal.CrossmarkPolicy with user-based queryset filtering Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Optimize CrossmarkPolicyAdmin queryset by removing unnecessary select_related fields Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Restrict journal field queryset in CrossmarkPolicy Create/Edit views to prevent privilege escalation Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Decouple CrossmarkPolicyJournalFilterMixin from JournalFormValidMixin, move journal panel to ViewSet Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
…nalSerializer (scieloorg#1386) * Initial plan * Add CrossmarkPolicy API endpoint at /api/v1/crossmarkpolicy Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Add docstring with example responses to CrossmarkPolicyViewSet Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Add crossmark_policy to JournalSerializer Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Use Q objects for ISSN filter so all params combine correctly Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Prefetch crossmark_policy in JournalViewSet; fix scielojournal_set prefetch name Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Fix N+1 queries: Python-level filtering in get_journal_acronym; proper Prefetch with select_related in viewsets Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Update journal/api/v1/views.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix get_journal_acronym fallback: return first unfiltered SciELOJournal when filters yield no match Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> * Simplify get_journal_acronym fallback expression for clarity Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…portação de collection.domain (scieloorg#1392) * Corrige a atribuição de collection.domain completando com o protocolo * Corrige a importação de AutocompletePanel
scieloorg#1387) * Remove .md * Adiciona context para ter os sponsors tem todos os templates * fix __str__ * Adiciona modelo para adicionar sponsors no template footer. * Altera template para exibir as imagens cadastradas na interface * add migration * fix validate_lattes * Adiciona get_homepage_with_language_locale
…com protocolo Adiciona propriedade base_url que verifica se o campo domain já possui protocolo (http:// ou https://) e, caso contrário, adiciona https://. Evita URLs malformadas como 'www.scielo.br/scielo.php?...' geradas anteriormente ao concatenar domain diretamente nas strings de URL.
Remove o atributo classname='collapsed' de todos os InlinePanels nos painéis do Journal e SciELOJournal (other_titles, mission, history, focus, thematic_area, title_in_database, owner_history, publisher_history, sponsor_history, related_journal_urls, open_science_form_files, open_access_text, open_data, preprint, peer_review, notes, journal_history). Os painéis passam a ser exibidos expandidos por padrão na interface Wagtail.
…agtail Adiciona o campo updated às listagens de AMJournalAdmin, IndexedAtAdmin, AdditionalIndexedAtAdmin, WebOfKnowledgeAdmin, SubjectAdmin, WosAreaAdmin e StandardAdmin. Facilita auditoria e identificação de registros recentemente modificados diretamente na listagem administrativa.
…d_pid_provider Em is_pp_xml_valid, envolve a busca por PidProviderXML em try/except para capturar DoesNotExist explicitamente, atribuindo None em vez de propagar exceção. Em add_pid_provider, substitui execução incondicional das etapas por verificações de estado antes de cada uma: - Etapa 1 (request_xml): pulada se self.file já existe em disco e force_update=False - Etapa 2 (request_pid): pulada se self.pid_provider_xml já está associado e force_update=False Permite reprocessar apenas a etapa faltante quando uma anterior já foi concluída com sucesso, reduzindo retrabalho e requisições desnecessárias.
…titui domain por base_url Em ArticleIndex: - Adiciona campos issn (MultiValueField), license, aff_country, aff_institution, open_access, indexed_at (MultiValueField) e crossmark_active (BooleanField) com seus respectivos métodos prepare_*. - Substitui collection.domain por collection.base_url na composição de URLs de fulltext PDF, HTML e identificadores, garantindo protocolo correto nas URLs indexadas. Em ArticleOAIIndex: - Adiciona campos issn, publisher, orcid e format_ com métodos prepare_*. - Simplifica prepare_date retornando obj.pub_date diretamente em vez de construir string manualmente com pub_date_year/month/day. - Substitui collection.domain por collection.base_url nas URLs de identificadores OAI.
…xml e adiciona return None Envolve a chamada PidProviderXML.get_by_pid_v3 em try/except para capturar DoesNotExist explicitamente, evitando propagação de exceção quando o registro simplesmente não existe. Adiciona return None ao final da função para garantir retorno explícito em todos os fluxos de erro, prevenindo retorno implícito None não documentado.
…_issue_from_articlemeta Padroniza o nome da task removendo o underscore extra entre 'article' e 'meta', alinhando com a convenção adotada nas demais tasks do projeto. Atualiza todas as referências ao nome da task nas chamadas a UnexpectedEvent.create (campos action) para refletir o novo nome.
…o de artigos Cria a classe ArticleIteratorBuilder em controller.py, consolidando em um único objeto iterável a lógica de seleção de artigos dispersa nas antigas tasks de orquestração. Iteradores disponíveis e seus argumentos ativadores: - _iter_from_pid_provider: filtra PidProviderXML por periódico, data e proc_status_list; padrão quando nenhum argumento exclusivo é fornecido. - _iter_from_article: filtra Article por data_status_list, tenta recuperar pp_xml ausente via PidProviderXML.get_by_pid_v3 e emite None em caso de falha (sinaliza skip para o despachante). - _iter_from_harvest: instancia OPACHarvester (coleção scl) ou AMHarvester (demais coleções) via _build_harvester e itera documentos coletados. - _iter_from_article_source: itera ArticleSources pendentes ou com erro via ArticleSource.get_queryset_to_complete_data. O método __iter__ encadeia todos os iteradores ativos simultaneamente, permitindo processar múltiplas fontes em uma única execução. Remove imports não utilizados (datetime, Q, date_utils, SciELOJournal, XMLVersionXmlWithPreError, PPXML_STATUS_DUPLICATED, etc.).
…icles e task_process_article_pipeline Remove as seguintes tasks obsoletas: - task_select_articles_to_complete_data - task_select_articles_to_load_from_api - task_select_articles_to_load_from_collection_endpoint - task_load_article_from_xml_url - task_select_articles_to_load_from_article_source - task_load_articles - task_load_journal_articles - task_load_article_from_pp_xml - task_fix_journal_articles_status Introduz task_dispatch_articles: - Orquestradora unificada que delega seleção ao ArticleIteratorBuilder. - Aceita todos os filtros comuns (collection, journal, datas, anos) e os argumentos exclusivos que ativam cada iterador (proc_status_list, data_status_list, limit/timeout/opac_url, article_source_status_list). - Contabiliza dispatched e skipped, retornando resumo da operação. Introduz task_process_article_pipeline: - Consolida os três fluxos de entrada em uma única task: Fluxo A: xml_url + collection_acron + pid → AMArticle → ArticleSource → pp_xml_id Fluxo B: article_source_id → add_pid_provider → pp_xml_id Fluxo C: pp_xml_id direto → load_article - Após carregar o artigo, exporta para ArticleMeta via task_export_article_to_articlemeta se export_to_articlemeta=True. Adiciona docstrings detalhadas às tasks existentes mantidas: load_funding_data, load_preprint, task_convert_xml_to_other_formats_for_articles, convert_xml_to_other_formats, transfer_license_statements_fk_to_article_license, normalize_stored_email. Remove imports não utilizados (group, transaction, Count, F, Prefetch, Subquery, fetch_data, OPACHarvester, SciELOJournal, PidProvider, etc.).
…pande lista de tasks obsoletas Em delete_outdated_tasks: - Adiciona ao registro de limpeza as tasks removidas: task_select_articles_to_complete_data, task_select_articles_to_load_from_api, task_select_articles_to_load_from_collection_endpoint, task_select_articles_to_load_from_article_source, task_load_articles, task_load_journal_articles, task_load_article_from_xml_url, task_create_article_source, task_create_pid_provider_xml, task_fix_journal_articles_status, task_select_articles_to_export_to_articlemeta e issue.tasks.load_issue_from_article_meta (com e sem namespace). Em schedule_tasks: - Chama delete_outdated_tasks no início para garantir limpeza automática. - Substitui schedule_task_select_articles_to_complete_data, schedule_task_select_articles_to_load_from_api, schedule_task_select_articles_to_load_from_article_source e schedule_task_load_articles pela nova schedule_task_dispatch_articles. - Substitui schedule_load_issue_from_article_meta por schedule_load_issue_from_articlemeta. - Remove chamada a schedule_bigbang_delete_outdated_tasks (substituída pela chamada direta a delete_outdated_tasks no início de schedule_tasks). Em schedule_task_dispatch_articles: - Configura kwargs completos da nova task unificada com todos os parâmetros de filtro e controle, agendada para executar diariamente às 02:01. Em schedule_task_export_articles_to_articlemeta: - Corrige nome da task de task_select_articles_to_export_to_articlemeta para task_export_articles_to_articlemeta. Em schedule_load_issue_from_articlemeta: - Atualiza kwargs para a nova assinatura da task renomeada, removendo parâmetros obsoletos (collection, issn_scielo, limit, reset) e ajustando timeout para 30.
Substitui a tradução genérica 'Editor' por 'Entidad Editora', termo mais preciso no contexto editorial em espanhol e alinhado com a terminologia utilizada nas demais interfaces do SciELO em língua espanhola.
There was a problem hiding this comment.
Pull request overview
This PR extends the XML ingestion resiliency work by consolidating the article processing/orchestration flow (Celery) and persisting richer failure metadata (including preserving failing XML content), while also expanding journal/Crossmark support and improving i18n and Solr indexing.
Changes:
- Consolidates article ingestion orchestration into unified iterator + dispatch/pipeline tasks, and introduces/uses URL/XML error statuses to enable selective retries.
- Adds
XMLURLpersistence (including ZIP preservation of fetched XML) and refactors PID Provider lookup/dedup helpers. - Expands journal/Crossmark policy admin + API exposure, improves Solr index fields, and introduces configurable HomePage footer sponsors + broader template i18n.
Reviewed changes
Copilot reviewed 51 out of 52 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| researcher/models.py | Lattes identifier extraction/normalization changes in ResearcherIds. |
| requirements/base.txt | Bumps packtools version. |
| pid_provider/wagtail_hooks.py | Updates CommonControlFieldViewSet import location. |
| pid_provider/test_models.py | Adds tests for new XMLURL behavior and BasePidProvider failure/success flows. |
| pid_provider/sources/harvesting.py | Uses new PID lookup helper and makes return behavior explicit. |
| pid_provider/models.py | Adds XMLURL model + zip storage, refactors PID lookup/dedup logic, extends file path limits. |
| pid_provider/migrations/0015_alter_xmlversion_file_xmlurl.py | Migration for XMLVersion file length and new XMLURL table. |
| pid_provider/base_pid_provider.py | Adds XMLURL-based failure persistence + refactors error handling paths. |
| organization/REFACTORING.md | Removes refactoring documentation file. |
| location/models.py | Minor __str__ / __unicode__ formatting changes. |
| locale/es/LC_MESSAGES/django.po | Updates Spanish translation for “Publisher”. |
| journalpage/templates/journalpage/journal_page.html | Adds {% trans %} for multiple static strings. |
| journalpage/templates/journalpage/includes/share.html | Translates “Imprimir”. |
| journalpage/templates/journalpage/includes/levelMenu.html | Translates menu labels/strings. |
| journalpage/templates/journalpage/includes/journal_info.html | Translates journal info labels. |
| journalpage/templates/journalpage/includes/header.html | Translates header dropdown entries. |
| journalpage/templates/journalpage/includes/footer.html | Adds i18n and blocktrans for footer text. |
| journalpage/templates/journalpage/includes/contact_footer.html | Adds i18n and translates contact/footer strings. |
| journalpage/templates/journalpage/base.html | Translates modal strings and “Reportar erro”. |
| journalpage/templates/journalpage/about.html | Translates “Atualizado”. |
| journal/wagtail_hooks.py | Adds CrossmarkPolicy admin (filtered by user permissions) + tweaks list displays/filters. |
| journal/tasks.py | Extends ArticleMeta journal load tasks to support journal_issn_list. |
| journal/sources/article_meta.py | Refactors journal fetch/store and allows targeted ISSN processing. |
| journal/proxys.py | Adds Crossmark policy inline panels to journal proxy editors/policy panels. |
| journal/models.py | Adds crossref_configuration FK, introduces CrossmarkPolicy model + related helpers, adjusts panels and logo URL logic. |
| journal/migrations/0058_remove_journal_doi_prefix_journal_crossref_configuration.py | Migrates legacy doi_prefix into CrossRefConfiguration FK then removes old field. |
| journal/migrations/0057_crossmarkpolicy.py | Creates CrossmarkPolicy model. |
| journal/choices.py | Adds UPDATE_POLICY_TYPE choices list. |
| journal/api/v1/views.py | Adds CrossmarkPolicy endpoint + optimizes queryset prefetching. |
| journal/api/v1/serializers.py | Adds CrossmarkPolicySerializer and exposes crossmark policies + doi_prefix via FK. |
| issue/test_format_code_sections.py | Adds unit tests ensuring toc-driven section formatting. |
| issue/tasks.py | Renames task function and updates UnexpectedEvent action strings. |
| issue/formats/articlemeta_format.py | Switches code section output source to table_of_contents. |
| docs/EDITORIAL_BOARD_FORM_IMPROVEMENTS.md | Removes documentation file. |
| core/viewsets.py | Removes CommonControlFieldViewSet from this module. |
| core/views.py | Moves/introduces CommonControlFieldViewSet into views module. |
| core/templates/home/scieloorg/footer.html | Renders footer sponsors dynamically with Wagtail images + fallback to static logos. |
| core/home/models.py | Adds HomePageSponsor orderable model + InlinePanel on HomePage. |
| core/home/migrations/0014_homepagesponsor.py | Creates HomePageSponsor table. |
| core/home/context_processors.py | Adds global context processor to provide footer_sponsors. |
| config/settings/base.py | Registers sponsors context processor; changes profiling logger handler. |
| config/api_router.py | Registers crossmarkpolicy API route. |
| compose/local/django/Dockerfile | Runs container as non-root django user with configurable UID. |
| collection/models.py | Adds base_url (protocol-safe) and normalizes stored domain. |
| bigbang/tasks_scheduler.py | Removes obsolete article task schedules and replaces with unified dispatch task; updates issue schedule name. |
| article/tasks.py | Introduces unified dispatch + pipeline task; removes fragmented selection/load tasks; expands docstrings. |
| article/sources/xmlsps.py | Uses new PID lookup helper and reduces repeated xml_with_pre access. |
| article/search_indexes.py | Adds new Solr fields and fixes URL building with protocol-safe collection.base_url. |
| article/models.py | Adds URL/XML error statuses, refactors ArticleSource pipeline with skip logic, adds autocomplete helpers. |
| article/migrations/0048_alter_articlesource_status.py | Adds URL_ERROR/XML_ERROR to ArticleSource status choices. |
| article/controller.py | Adds ArticleIteratorBuilder to unify selection logic across multiple sources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _handle_pid_provider_failure(self, response, xml_with_pre, xml_uri, name, user, origin_date, force_update, is_published): | ||
| """Handle exception type b) - XML obtained but PidProviderXML creation failed""" | ||
| # Format error information from response (not from an exception context) | ||
| error_msg = response.get("error_message", "Unknown error") | ||
| error_type = response.get("error_type", "Unknown") | ||
| error_info = f"{error_type}: {error_msg}" | ||
| truncated_error = _truncate_traceback(error_info) | ||
|
|
There was a problem hiding this comment.
_handle_pid_provider_failure reads response["error_message"], but PidProviderXML.register() populates "error_msg" (and not "error_message") on exceptions. As written, XMLURL.exceptions will frequently record "Unknown error" and lose the real failure detail. Read from "error_msg" (and/or support both keys) when formatting the stored error.
| {% if loop.last %} | ||
| <span class="text">Siga este periódico nas redes sociais</span> | ||
| <span class="text">{% trans 'Siga este periódico nas redes sociais' %}</span> | ||
| {% endif %} |
There was a problem hiding this comment.
This template uses {% if loop.last %}, but Django templates use forloop.last ("loop" is a Jinja variable). As-is, this will raise a TemplateSyntaxError / never render as intended. Replace with {% if forloop.last %}.
| elif self.source_name == "LATTES" and self.identifier: | ||
| clean_value = self.extract_lattes(self.identifier) | ||
| self.identifier = clean_value | ||
| super().save(**kwargs) |
There was a problem hiding this comment.
ResearcherIds.save() normalizes LATTES identifiers via extract_lattes() but does not validate the normalized value before saving. Since model.save() may be called without full_clean(), this can persist invalid identifiers (e.g., URLs without a 16-digit ID). Call validate_lattes(clean_value) (or run full_clean()) before assigning/saving, similar to the EMAIL path.
| yield from self._iter_from_harvest() | ||
| yield from self._iter_from_article_source() | ||
| yield from self._iter_from_pid_provider() | ||
| yield from self._iter_from_article() | ||
|
|
There was a problem hiding this comment.
ArticleIteratorBuilder.iter always yields from _iter_from_harvest() and _iter_from_article_source() regardless of which selector args were provided. This contradicts the class docstring (which says these iterators should only be active when their exclusive args are present) and can cause an unexpected full harvest of all collections + extra dispatch duplicates on every run. Gate each iterator behind the intended activation condition (or compute an explicit list of active iterators) so the default path only iterates from PidProviderXML.
| yield from self._iter_from_harvest() | |
| yield from self._iter_from_article_source() | |
| yield from self._iter_from_pid_provider() | |
| yield from self._iter_from_article() | |
| """ | |
| Seleciona o iterador apropriado de acordo com os argumentos | |
| exclusivos fornecidos no construtor, conforme documentado na | |
| tabela da docstring da classe. | |
| Prioridade / mapeamento: | |
| - opac_url -> _iter_from_harvest | |
| - article_source_status_list -> _iter_from_article_source | |
| - data_status_list -> _iter_from_article | |
| - nenhum -> _iter_from_pid_provider (padrão) | |
| """ | |
| if self.opac_url: | |
| yield from self._iter_from_harvest() | |
| elif self.article_source_status_list: | |
| yield from self._iter_from_article_source() | |
| elif self.data_status_list: | |
| yield from self._iter_from_article() | |
| else: | |
| yield from self._iter_from_pid_provider() |
| article_source = ArticleSource.create_or_update( | ||
| user=user, | ||
| url=xml_url, | ||
| source_date=source_date, | ||
| force_update=force_update, | ||
| am_article=am_article, | ||
| auto_solve_pid_conflict=auto_solve_pid_conflict, | ||
| ) | ||
| return { | ||
| "status": "success", | ||
| "message": "Processing all articles without journal filters", | ||
| "filters": { | ||
| "from_pub_year": from_pub_year, | ||
| "until_pub_year": until_pub_year, | ||
| "from_updated_date": from_updated_date, | ||
| "until_updated_date": until_updated_date, | ||
| "proc_status_list": proc_status_list, | ||
| }, | ||
| } | ||
| pp_xml_id = article_source.pid_provider_xml.id | ||
|
|
||
| if article_source_id: | ||
| article_source = ArticleSource.objects.get(id=article_source_id) | ||
| article_source.add_pid_provider( | ||
| user=user, | ||
| force_update=force_update, | ||
| auto_solve_pid_conflict=auto_solve_pid_conflict, | ||
| ) | ||
| pp_xml_id = article_source.pid_provider_xml.id | ||
|
|
There was a problem hiding this comment.
task_process_article_pipeline assumes ArticleSource.create_or_update()/add_pid_provider always sets pid_provider_xml and then unconditionally accesses article_source.pid_provider_xml.id. But add_pid_provider swallows URL/XML errors and may leave pid_provider_xml unset (status url_error/xml_error), which will raise AttributeError here and turn a handled ingestion failure into a task crash. After creating/updating the ArticleSource, check for pid_provider_xml (and/or is_completed/status) and return a structured error/skip result instead of dereferencing blindly.
| url = models.URLField( | ||
| _("URL"), max_length=500, null=False, blank=False | ||
| ) | ||
| status = models.CharField( | ||
| _("Status"), max_length=50, null=True, blank=True | ||
| ) | ||
| pid = models.CharField( | ||
| _("Article PID"), max_length=23, null=True, blank=True | ||
| ) | ||
| zipfile = models.FileField( | ||
| _("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300, | ||
| ) | ||
| exceptions = models.CharField( | ||
| _("Exceptions"), max_length=255, null=True, blank=True | ||
| ) | ||
|
|
||
| base_form_class = CoreAdminModelForm | ||
|
|
||
| panels = [ | ||
| FieldPanel("url"), | ||
| FieldPanel("status"), | ||
| FieldPanel("pid"), | ||
| FieldPanel("zipfile"), | ||
| FieldPanel("exceptions"), | ||
| ] | ||
|
|
||
| class Meta: | ||
| ordering = ["-updated", "-created"] | ||
| verbose_name = _("XML URL") | ||
| verbose_name_plural = _("XML URLs") | ||
|
|
||
| indexes = [ | ||
| models.Index(fields=["url"]), | ||
| models.Index(fields=["status"]), | ||
| models.Index(fields=["pid"]), | ||
| ] | ||
|
|
||
| def __str__(self): | ||
| return f"{self.url} - {self.status}" | ||
|
|
||
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
||
| @classmethod | ||
| def create_or_update( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls.get(url=url) | ||
| obj.updated_by = user | ||
| if status is not None: | ||
| obj.status = status | ||
| if pid is not None: | ||
| obj.pid = pid | ||
| if exceptions is not None: | ||
| obj.exceptions = exceptions | ||
| obj.save() | ||
| return obj | ||
| except cls.DoesNotExist: | ||
| return cls.create( | ||
| user, | ||
| url, | ||
| status, | ||
| pid, | ||
| exceptions, | ||
| ) |
There was a problem hiding this comment.
XMLURL.create()/create_or_update() assumes URL collisions will raise IntegrityError and that URL is unique, but the model/migration do not define url as unique. This can silently create duplicates; later create_or_update() will start raising MultipleObjectsReturned from XMLURL.get(url=...). Add a uniqueness constraint for url (or implement deterministic upsert logic that can handle duplicates) and adjust create() accordingly.
| """ | ||
| # Use URL hash to create a unique subdirectory | ||
| url_hash = abs(hash(instance.url)) % (10 ** 8) | ||
| return f"pid_provider/xmlurl/{url_hash}/{filename}" |
There was a problem hiding this comment.
xml_url_zipfile_path uses Python's built-in hash(instance.url) to build the upload path. Python hash randomization means the same URL can map to different directories across processes/restarts, making paths non-deterministic and harder to debug/migrate. Use a stable hash (e.g., sha1/md5 of the URL string) or derive a safe slug instead.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
fechado por aparentemente conter mudanças de outros pr |
O que esse PR faz?
Este PR introduz tratamento de erros robusto no pipeline de ingestão de artigos via XML (
RequestXMLException, statusURL_ERROR/XML_ERROR, modeloXMLURL, refatoração de deduplicação). O presente PR constrói sobre essa base e consolida o pipeline de processamento de artigos, eliminando a fragmentação das tasks Celery de orquestração.Problemas resolvidos (este PR):
XMLURL.pkg_nameestava dispersa; foi centralizada noPidProviderXML.core/views.py.task_select_articles_to_complete_data,task_select_articles_to_load_from_api,task_select_articles_to_load_from_collection_endpoint,task_load_articles,task_load_journal_articles,task_select_articles_to_load_from_article_source), cada uma reimplementando filtros por coleção, periódico e data de forma independente.request_xmlerequest_pidemadd_pid_providerreexecutadas desnecessariamente mesmo quando já concluídas com sucesso — agora integradas com os novos status do PR anterior para decisão de skip.PidProviderXML.DoesNotExistnão tratadas explicitamente emis_pp_xml_valid,provide_pid_for_opac_and_am_xmle_iter_from_article.collection.domaindiretamente nos índices Solr.Funcionalidades adicionadas (este PR):
ArticleIteratorBuilderemcontroller.py: encapsula em um único objeto iterável os quatro modos de seleção de artigos (pid_provider,article,harvest,article_source), ativáveis simultaneamente via argumentos, e tira proveito dos novos status de erro introduzidos no PR anterior para filtrarArticleSourcecom precisão.task_dispatch_articles: orquestradora unificada que delega a seleção aoArticleIteratorBuildere disparatask_process_article_pipelinepara cada item.task_process_article_pipeline: consolida os três fluxos de entrada (XML URL → ArticleSource → PidProviderXML → Article; ArticleSource existente; PidProviderXML direto) em uma única task atômica.cached_property base_urlemCollection: retorna o domain com protocolohttps://quando ausente, corrigindo URLs nos índices Solr.Onde a revisão poderia começar?
Recomenda-se seguir a ordem abaixo, que respeita a cadeia de dependências entre os dois PRs:
pid_provider/models.py— modeloXMLURLe refatoração de deduplicação (base do PR anterior): entender como falhas são persistidas e como duplicatas são identificadas.pid_provider/sources/harvesting.py— tratamento deDoesNotExistereturn Noneadicionados: ponto de integração entre os dois PRs.article/models.py— métodoadd_pid_providercom verificações de estado (has_valid_file,has_pid_provider) eis_pp_xml_validcomDoesNotExisttratado: confirmar que os novos statusURL_ERROR/XML_ERRORdo PR anterior são corretamente considerados nas decisões de skip.article/controller.py— classeArticleIteratorBuilder: entender os quatro iteradores e a lógica de encadeamento via__iter__.article/tasks.py— funçõestask_dispatch_articlesetask_process_article_pipeline: verificar se os três fluxos de entrada do pipeline estão corretos e se o tratamento de erro cobre todos os casos.Como este poderia ser testado manualmente?
Pré-requisitos: ambiente local com Docker, banco populado com ao menos um
PidProviderXMLcomproc_status = todoe uma coleção cadastrada.Cenário 1 — falha de XML com preservação em ZIP (PR anterior):
Cenário 2 — reprocessamento parcial sem repetir etapa concluída:
Cenário 3 — fluxo padrão via pid_provider:
Cenário 4 — múltiplas fontes simultâneas:
Cenário 5 — deduplicação (PR anterior):
Cenário 6 — verificação dos índices Solr:
Cenário 7 — limpeza de tasks obsoletas:
Testes automatizados:
Algum cenário de contexto que queira dar?
O pipeline de artigos cresceu organicamente ao longo de múltiplos ciclos de desenvolvimento. O PR anterior atacou a camada de resiliência — garantindo que falhas de rede e erros de XML fossem registrados, preservados e recuperáveis. Este PR ataca a camada de orquestração — que havia acumulado seis tasks com sobreposição de responsabilidades, cada uma reimplementando os mesmos filtros de forma independente.
A integração entre os dois PRs é direta:
add_pid_provideragora decide se pularequest_xmlourequest_pidcom base no estado do objeto, que por sua vez pode ter sido marcado comURL_ERRORouXML_ERRORpelo mecanismo introduzido no PR anterior. OArticleIteratorBuilderusaArticleSource.get_queryset_to_complete_datacom os novos status para selecionar apenas os registros que realmente precisam de reprocessamento.O bug de URLs sem protocolo nos índices Solr era silencioso — registros eram indexados com
www.scielo.br/scielo.php?...em vez dehttps://www.scielo.br/scielo.php?..., causando falhas apenas no momento de uso pelo cliente de busca.Screenshots
Não aplicável — alterações em tasks Celery, modelos e índices Solr sem impacto direto na interface gráfica.
Quais são tickets relevantes?
XMLURL, deduplicação, centralização de Viewsets).Referências