{"id":2859,"date":"2024-02-18T15:29:24","date_gmt":"2024-02-18T15:29:24","guid":{"rendered":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/"},"modified":"2024-02-18T15:29:24","modified_gmt":"2024-02-18T15:29:24","slug":"efficient-website-data-scraping-for-improved-data-management","status":"publish","type":"resource","link":"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/","title":{"rendered":"Recolha eficiente de dados de s\u00edtios Web para uma melhor gest\u00e3o de dados"},"content":{"rendered":"<h2 style=\"text-align: center;\">Principais pormenores<\/h2>\n<p>Aceder a m\u00faltiplas fontes de dados com a extra\u00e7\u00e3o de dados.<\/p>\n<div>\n<ul>\n<li>\n<div>Desafio<\/div>\n<div>Recolha r\u00e1pida e exacta de dados de v\u00e1rias fontes<\/div>\n<\/li>\n<li>\n<div>Solu\u00e7\u00e3o<\/div>\n<div>Melhores pr\u00e1ticas para uma recolha de dados da Web robusta e resistente<\/div>\n<\/li>\n<li>\n<div>Tecnologias e ferramentas<\/div>\n<div>Microsoft Azure Cloud Services para alojamento, afina\u00e7\u00e3o e administra\u00e7\u00e3o de infra-estruturas. Linguagem Python com as bibliotecas e estruturas necess\u00e1rias (Azure-sdk, Scrapy, Selenium, etc.) para o processo de recolha e rastreio de s\u00edtios Web<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h2 style=\"text-align: center;\">Cliente<\/h2>\n<p>O cliente \u00e9 uma organiza\u00e7\u00e3o n\u00e3o comercial que presta apoio a pequenas empresas e empres\u00e1rios afro-americanos. Orgulham-se de fornecer servi\u00e7os que ajudam os empres\u00e1rios afro-americanos a obter subs\u00eddios e a obter sucesso em concursos.<\/p>\n<h2 style=\"text-align: center;\">Desafio: recolha r\u00e1pida e exacta de dados de v\u00e1rias fontes<\/h2>\n<p>O cliente lida regularmente com grandes quantidades de dados provenientes de v\u00e1rias fontes. Por isso, a gest\u00e3o de dados tornou-se uma preocupa\u00e7\u00e3o para eles.<\/p>\n<p>Pretendiam recolher ofertas de emprego, orienta\u00e7\u00e3o e oportunidades de rede para empres\u00e1rios afro-americanos talentosos de v\u00e1rios s\u00edtios Web e public\u00e1-las na sua pr\u00f3pria plataforma. Assim, os empres\u00e1rios podem facilmente descobrir empresas pertencentes a afro-americanos e apoi\u00e1-las ou criar a sua pr\u00f3pria empresa.<\/p>\n<p>A ESSID Solutions foi desafiada a desenvolver uma solu\u00e7\u00e3o s\u00f3lida de recolha de dados para o mercado do cliente.<\/p>\n<h2 style=\"text-align: center;\">Solu\u00e7\u00e3o: melhores pr\u00e1ticas para uma recolha de dados da Web robusta e resiliente<\/h2>\n<p>A nossa equipa de engenheiros aplicou a sua experi\u00eancia em recolha de dados para permitir uma recolha eficaz de dados de v\u00e1rias fontes.<\/p>\n<p><a href=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png\" rel=\"noopener\" target=\"_blank\"><img alt=\"Esquema de solu\u00e7\u00e3o de raspagem de dados de sites\" decoding=\"async\" height=\"1716\" loading=\"lazy\" sizes=\"(max-width: 1200px) 100vw, 1200px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png 1200w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-210x300.png 210w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-716x1024.png 716w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-768x1098.png 768w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-1074x1536.png 1074w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-519x742.png 519w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-448x640.png 448w\" width=\"1200\"\/><\/a><\/p>\n<p>A equipa da ESSID Solutions teve de configurar a infraestrutura e o fluxo de c\u00f3digo para o cliente:<\/p>\n<ol>\n<li>\n<h3>Parte Git e CI\/CD<\/h3>\n<p>Para a gest\u00e3o de c\u00f3digo, foi utilizado o reposit\u00f3rio AzureDevOps com uma configura\u00e7\u00e3o de pipeline que permitiu \u00e0 nossa equipa criar e enviar imagens docker para o registo utilizando um agente de trabalho paralelo.<\/li>\n<li>\n<h3>Registo e parte da aplica\u00e7\u00e3o l\u00f3gica<\/h3>\n<p>Em seguida, cri\u00e1mos o Registo de Contentores do Azure Docker em <a href=\"http:\/\/localhost\/essidsolutions\/service\/azure-data-analytics-services\">Azure<\/a> para armazenar as nossas imagens docker.<\/p>\n<p>Em seguida, precis\u00e1mos de criar inst\u00e2ncias de docker a partir de imagens utilizando a aplica\u00e7\u00e3o Azure Logic para executar o c\u00f3digo do raspador em paralelo e separadamente.<\/li>\n<li>\n<h3>Pe\u00e7a do raspador<\/h3>\n<p>Durante esta fase, a equipa da ESSID Solutions criou inst\u00e2ncias de contentores com aplica\u00e7\u00f5es Logic. Depois, precis\u00e1mos de dar a cada contentor acesso aos recursos do Azure e a dados sens\u00edveis, como palavras-passe, cadeias de liga\u00e7\u00e3o, etc., que estavam armazenados no Azure KeyVault.<\/p>\n<p><img alt=\"projeto de pe\u00e7a de raspador\" decoding=\"async\" height=\"301\" loading=\"lazy\" sizes=\"(max-width: 871px) 100vw, 871px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write.png 871w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-300x104.png 300w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-768x265.png 768w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-742x256.png 742w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-640x221.png 640w\" width=\"871\"\/><\/p>\n<p>Para armazenar os resultados dos scrapers, a nossa equipa decidiu criar uma conta de armazenamento que seria como uma pasta na nuvem para guardar os dados recolhidos. Depois disso, conseguimos iniciar os nossos scrapers de forma manual, mas precis\u00e1vamos de alguma orquestra\u00e7\u00e3o, automatiza\u00e7\u00e3o e p\u00f3s-processamento.<\/li>\n<li>\n<h3>F\u00e1brica de dados e parte de orquestra\u00e7\u00e3o<\/h3>\n<p>Os nossos engenheiros executaram todos os nossos scrapers com acionador de tempo e num \u00fanico pipeline executado com o Azure Data Factory.<\/p>\n<p>O pipeline principal deveria iniciar todos os contentores com pedidos atrav\u00e9s da API do Azure e, em seguida, executar <a href=\"http:\/\/localhost\/essidsolutions\/service\/databricks-managed-services\">DataBricks<\/a> Cadernos de notas para tratar os dados recolhidos.<\/p>\n<p><img alt=\"projeto de parte de orquestra\u00e7\u00e3o\" decoding=\"async\" height=\"305\" loading=\"lazy\" sizes=\"(max-width: 757px) 100vw, 757px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run.png 757w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-300x121.png 300w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-742x299.png 742w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-640x258.png 640w\" width=\"757\"\/><\/li>\n<li>\n<h3>DataBricks<\/h3>\n<p>Nesta fase, est\u00e1vamos a eliminar todos os dados dos s\u00edtios Web (uma vez que o carregamento incremental de dados dos s\u00edtios Web n\u00e3o \u00e9 poss\u00edvel ou dif\u00edcil) e a processar\/gravar todos os dados na base de dados. Antes de carregar novos dados para a base de dados, elimin\u00e1mos os dados existentes.<\/p>\n<p>Como resultado, o cliente tem uma solu\u00e7\u00e3o robusta de extra\u00e7\u00e3o de dados que extrai dados de v\u00e1rios s\u00edtios e listas de empresas e re\u00fane informa\u00e7\u00f5es sobre empresas fundadas por afro-americanos que s\u00e3o \u00fateis para os subscritores da plataforma do cliente.<\/li>\n<\/ol>\n<h2 style=\"text-align: center;\">Resultado: otimiza\u00e7\u00e3o da recolha de dados para reduzir o tempo de processamento<\/h2>\n<p>A nossa equipa de cientistas e engenheiros de dados recorreu a v\u00e1rias fontes para satisfazer as necessidades de recolha de dados do cliente.<\/p>\n<p>A nossa solu\u00e7\u00e3o permitiu que o cliente se tornasse mais capaz das seguintes formas<\/p>\n<ul>\n<li><a href=\"https:\/\/essidsolutions.com\/data-extraction\">Extra\u00e7\u00e3o de dados \u00e0 escala<\/a><\/li>\n<li>Dados estruturados fornecidos<\/li>\n<li>Baixa manuten\u00e7\u00e3o e rapidez<\/li>\n<li>F\u00e1cil de implementar<\/li>\n<li>Automatiza\u00e7\u00e3o.<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Detalhes principais Aceder a v\u00e1rias fontes de dados com a recolha de dados. Desafio Recolha r\u00e1pida e exacta de dados de v\u00e1rias fontes Solu\u00e7\u00e3o Melhores pr\u00e1ticas para recolha de dados na Web robusta e resiliente Tecnologias e ferramentas Servi\u00e7os de Nuvem do Microsoft Azure para alojamento, afina\u00e7\u00e3o e administra\u00e7\u00e3o de infra-estruturas. Linguagem Python com as bibliotecas e estruturas necess\u00e1rias (Azure-sdk, Scrapy, Selenium, etc.) para raspagem de s\u00edtios Web ... Ler mais <a title=\"Recolha eficiente de dados de s\u00edtios Web para uma melhor gest\u00e3o de dados\" class=\"read-more\" href=\"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/\" aria-label=\"Leia mais sobre Efficient Website Data Scraping for Improved Data Management\">Ler mais<\/a><\/p>","protected":false},"featured_media":2860,"template":"","industry":[77,70],"expertise":[74,65,78,72,58],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.9 (Yoast SEO v21.9.1) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/\" \/>\n<meta property=\"og:locale\" content=\"pt_PT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Efficient Website Data Scraping for Improved Data Management\" \/>\n<meta property=\"og:description\" content=\"Key Details Accessing multiple data sources with data scraping. Challenge Fast and accurate data scraping from multiple sources Solution Best practices for robust &amp; resilient web scraping Technologies and tools Microsoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping ... Ler mais\" \/>\n<meta property=\"og:url\" content=\"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/\" \/>\n<meta property=\"og:site_name\" content=\"ESISOC | ESSID Solutions\" \/>\n<meta property=\"og:image\" content=\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/83ede7fb50b04acc8e2536d6b92b7761.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"839\" \/>\n\t<meta property=\"og:image:height\" content=\"514\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Tempo estimado de leitura\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\",\"url\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\",\"name\":\"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions\",\"isPartOf\":{\"@id\":\"https:\/\/esisoc.com\/#website\"},\"datePublished\":\"2024-02-18T15:29:24+00:00\",\"dateModified\":\"2024-02-18T15:29:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb\"},\"inLanguage\":\"pt-PT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/esisoc.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Efficient Website Data Scraping for Improved Data Management\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/esisoc.com\/#website\",\"url\":\"https:\/\/esisoc.com\/\",\"name\":\"ESISOC | ESSID Solutions\",\"description\":\"Data Science Consulting and AI | Online Books, Videos, Courses and more\",\"publisher\":{\"@id\":\"https:\/\/esisoc.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/esisoc.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"pt-PT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/esisoc.com\/#organization\",\"name\":\"ESISOC | ESSID Solutions\",\"url\":\"https:\/\/esisoc.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"pt-PT\",\"@id\":\"https:\/\/esisoc.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png\",\"contentUrl\":\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png\",\"width\":350,\"height\":63,\"caption\":\"ESISOC | ESSID Solutions\"},\"image\":{\"@id\":\"https:\/\/esisoc.com\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/","og_locale":"pt_PT","og_type":"article","og_title":"Efficient Website Data Scraping for Improved Data Management","og_description":"Key Details Accessing multiple data sources with data scraping. Challenge Fast and accurate data scraping from multiple sources Solution Best practices for robust &amp; resilient web scraping Technologies and tools Microsoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping ... Ler mais","og_url":"https:\/\/esisoc.com\/pt\/resource\/recolha-eficiente-de-dados-de-sitios-web-para-uma-melhor-gestao-dos-dados\/","og_site_name":"ESISOC | ESSID Solutions","og_image":[{"width":839,"height":514,"url":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/83ede7fb50b04acc8e2536d6b92b7761.webp","type":"image\/webp"}],"twitter_card":"summary_large_image","twitter_misc":{"Tempo estimado de leitura":"3 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/","url":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/","name":"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions","isPartOf":{"@id":"https:\/\/esisoc.com\/#website"},"datePublished":"2024-02-18T15:29:24+00:00","dateModified":"2024-02-18T15:29:24+00:00","breadcrumb":{"@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb"},"inLanguage":"pt-PT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/esisoc.com\/"},{"@type":"ListItem","position":2,"name":"Efficient Website Data Scraping for Improved Data Management"}]},{"@type":"WebSite","@id":"https:\/\/esisoc.com\/#website","url":"https:\/\/esisoc.com\/","name":"ESISOC | ESSID Solutions","description":"Data Science Consulting and AI | Online Books, Videos, Courses and more","publisher":{"@id":"https:\/\/esisoc.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/esisoc.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"pt-PT"},{"@type":"Organization","@id":"https:\/\/esisoc.com\/#organization","name":"ESISOC | ESSID Solutions","url":"https:\/\/esisoc.com\/","logo":{"@type":"ImageObject","inLanguage":"pt-PT","@id":"https:\/\/esisoc.com\/#\/schema\/logo\/image\/","url":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png","contentUrl":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png","width":350,"height":63,"caption":"ESISOC | ESSID Solutions"},"image":{"@id":"https:\/\/esisoc.com\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/resource\/2859"}],"collection":[{"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/resource"}],"about":[{"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/types\/resource"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/media\/2860"}],"wp:attachment":[{"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/media?parent=2859"}],"wp:term":[{"taxonomy":"industry","embeddable":true,"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/industry?post=2859"},{"taxonomy":"expertise","embeddable":true,"href":"https:\/\/esisoc.com\/pt\/wp-json\/wp\/v2\/expertise?post=2859"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}