{"id":2859,"date":"2024-02-18T15:29:24","date_gmt":"2024-02-18T15:29:24","guid":{"rendered":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/"},"modified":"2024-02-18T15:29:24","modified_gmt":"2024-02-18T15:29:24","slug":"efficient-website-data-scraping-for-improved-data-management","status":"publish","type":"resource","link":"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/","title":{"rendered":"Scraping efficace de donn\u00e9es de sites web pour une meilleure gestion des donn\u00e9es"},"content":{"rendered":"<h2 style=\"text-align: center;\">D\u00e9tails cl\u00e9s<\/h2>\n<p>Acc\u00e9der \u00e0 plusieurs sources de donn\u00e9es avec le scraping de donn\u00e9es.<\/p>\n<div>\n<ul>\n<li>\n<div>D\u00e9fi<\/div>\n<div>R\u00e9cup\u00e9ration rapide et pr\u00e9cise de donn\u00e9es provenant de sources multiples<\/div>\n<\/li>\n<li>\n<div>Solution<\/div>\n<div>Meilleures pratiques pour un scraping web robuste et r\u00e9silient<\/div>\n<\/li>\n<li>\n<div>Technologies et outils<\/div>\n<div>Microsoft Azure Cloud Services pour l'h\u00e9bergement, le r\u00e9glage et l'administration de l'infrastructure. Langage Python avec les biblioth\u00e8ques et les cadres n\u00e9cessaires (Azure-sdk, Scrapy, Selenium, etc.) pour le processus de scraping et de crawling des sites web.<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h2 style=\"text-align: center;\">Client<\/h2>\n<p>Le client est une organisation non commerciale qui soutient les petites entreprises et les entrepreneurs afro-am\u00e9ricains. Elle est fi\u00e8re de fournir des services qui aident les hommes d'affaires afro-am\u00e9ricains \u00e0 obtenir des subventions et \u00e0 r\u00e9ussir dans les concours.<\/p>\n<h2 style=\"text-align: center;\">D\u00e9fi : extraction rapide et pr\u00e9cise de donn\u00e9es \u00e0 partir de sources multiples<\/h2>\n<p>Le client traite r\u00e9guli\u00e8rement d'\u00e9normes quantit\u00e9s de donn\u00e9es provenant de diverses sources. La gestion des donn\u00e9es est donc devenue une pr\u00e9occupation pour lui.<\/p>\n<p>Ils souhaitaient recueillir des offres d'emploi, des opportunit\u00e9s de mentorat et de r\u00e9seau pour les entrepreneurs afro-am\u00e9ricains talentueux sur diff\u00e9rents sites web et les publier sur leur propre plateforme. Ainsi, les entrepreneurs peuvent facilement d\u00e9couvrir des entreprises d\u00e9tenues par des Afro-Am\u00e9ricains et aller les soutenir ou cr\u00e9er leur propre entreprise.<\/p>\n<p>ESSID Solutions a \u00e9t\u00e9 mis au d\u00e9fi de d\u00e9velopper une solution de scraping de donn\u00e9es solide pour le march\u00e9 du client.<\/p>\n<h2 style=\"text-align: center;\">Solution : meilleures pratiques pour un web scraping robuste et r\u00e9silient<\/h2>\n<p>Notre \u00e9quipe d'ing\u00e9nieurs a mis \u00e0 profit son expertise en mati\u00e8re de grattage de donn\u00e9es pour permettre une collecte efficace de donn\u00e9es \u00e0 partir de diverses sources.<\/p>\n<p><a href=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png\" rel=\"noopener\" target=\"_blank\"><img alt=\"Sch\u00e9ma d&#039;une solution de r\u00e9cup\u00e9ration de donn\u00e9es sur les sites web\" decoding=\"async\" height=\"1716\" loading=\"lazy\" sizes=\"(max-width: 1200px) 100vw, 1200px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme.png 1200w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-210x300.png 210w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-716x1024.png 716w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-768x1098.png 768w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-1074x1536.png 1074w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-519x742.png 519w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-scheme-448x640.png 448w\" width=\"1200\"\/><\/a><\/p>\n<p>L'\u00e9quipe d'ESSID Solutions devait mettre en place l'infrastructure et le flux de code pour le client :<\/p>\n<ol>\n<li>\n<h3>Partie Git et CI\/CD<\/h3>\n<p>Pour la gestion du code, nous avons utilis\u00e9 le r\u00e9f\u00e9rentiel AzureDevOps avec une configuration de pipeline qui a permis \u00e0 notre \u00e9quipe de construire et de pousser des images Docker vers le registre en utilisant un agent de travail parall\u00e8le.<\/li>\n<li>\n<h3>Registre et logique App part<\/h3>\n<p>Ensuite, nous avons cr\u00e9\u00e9 Azure Docker Container Registry sur <a href=\"http:\/\/localhost\/essidsolutions\/service\/azure-data-analytics-services\">L'azur<\/a> pour stocker nos images docker.<\/p>\n<p>Ensuite, nous avons d\u00fb cr\u00e9er des instances docker \u00e0 partir d'images en utilisant Azure Logic app pour ex\u00e9cuter le code scraper en parall\u00e8le et s\u00e9par\u00e9ment.<\/li>\n<li>\n<h3>Pi\u00e8ce de raclage<\/h3>\n<p>Au cours de cette \u00e9tape, l'\u00e9quipe d'ESSID Solutions a cr\u00e9\u00e9 des instances de conteneurs avec des applications Logic. Ensuite, nous avons d\u00fb donner \u00e0 chaque conteneur l'acc\u00e8s aux ressources Azure et aux donn\u00e9es sensibles telles que les mots de passe, les cha\u00eenes de connexion, etc. qui \u00e9taient stock\u00e9es dans Azure KeyVault.<\/p>\n<p><img alt=\"projet de pi\u00e8ce de grattoir\" decoding=\"async\" height=\"301\" loading=\"lazy\" sizes=\"(max-width: 871px) 100vw, 871px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write.png 871w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-300x104.png 300w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-768x265.png 768w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-742x256.png 742w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-write-640x221.png 640w\" width=\"871\"\/><\/p>\n<p>Pour stocker les r\u00e9sultats des scrapeurs, notre \u00e9quipe a d\u00e9cid\u00e9 de cr\u00e9er un compte de stockage qui serait comme un dossier dans le nuage pour enregistrer les donn\u00e9es scrapp\u00e9es. Apr\u00e8s cela, nous avons pu d\u00e9marrer nos scrapeurs manuellement, mais nous avions besoin d'une certaine orchestration, d'une automatisation et d'un post-traitement.<\/li>\n<li>\n<h3>Data Factory et orchestration<\/h3>\n<p>Nos ing\u00e9nieurs ont ex\u00e9cut\u00e9 tous nos scraps avec des d\u00e9clencheurs temporels et dans un seul pipeline avec Azure Data Factory.<\/p>\n<p>Le pipeline principal \u00e9tait cens\u00e9 d\u00e9marrer tous les conteneurs avec des requ\u00eates via l'API azur, puis ex\u00e9cuter <a href=\"http:\/\/localhost\/essidsolutions\/service\/databricks-managed-services\">DataBricks<\/a> Des carnets de notes pour traiter les donn\u00e9es collect\u00e9es.<\/p>\n<p><img alt=\"orchestration partie projet\" decoding=\"async\" height=\"305\" loading=\"lazy\" sizes=\"(max-width: 757px) 100vw, 757px\" src=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run.png\" srcset=\"https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run.png 757w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-300x121.png 300w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-742x299.png 742w, https:\/\/essidsolutions.com\/wp-content\/uploads\/2023\/02\/case-data-scraping-solution-run-640x258.png 640w\" width=\"757\"\/><\/li>\n<li>\n<h3>DataBricks<\/h3>\n<p>\u00c0 ce stade, nous avons supprim\u00e9 toutes les donn\u00e9es des sites web (car le chargement incr\u00e9mentiel de donn\u00e9es \u00e0 partir de sites web n'est pas possible ou difficile) et nous avons proc\u00e9d\u00e9 au traitement et \u00e0 l'enregistrement de toutes les donn\u00e9es dans la base de donn\u00e9es. Avant de charger de nouvelles donn\u00e9es dans la base de donn\u00e9es, nous avons supprim\u00e9 les donn\u00e9es existantes.<\/p>\n<p>En cons\u00e9quence, le client dispose d'une solution de scraping de donn\u00e9es robuste qui r\u00e9cup\u00e8re les donn\u00e9es de plusieurs sites et listes d'entreprises et rassemble des informations sur les entreprises fond\u00e9es par des Afro-Am\u00e9ricains qui sont utiles pour les abonn\u00e9s de la plateforme du client.<\/li>\n<\/ol>\n<h2 style=\"text-align: center;\">R\u00e9sultat : optimisation du scraping de donn\u00e9es pour r\u00e9duire le temps de traitement<\/h2>\n<p>Notre \u00e9quipe de data scientists et d'ing\u00e9nieurs a puis\u00e9 dans de multiples sources pour r\u00e9pondre aux besoins du client en mati\u00e8re de scraping de donn\u00e9es.<\/p>\n<p>Notre solution a donn\u00e9 au client les moyens d'agir de la mani\u00e8re suivante :<\/p>\n<ul>\n<li><a href=\"https:\/\/essidsolutions.com\/data-extraction\">Extraction de donn\u00e9es \u00e0 grande \u00e9chelle<\/a><\/li>\n<li>Livraison de donn\u00e9es structur\u00e9es<\/li>\n<li>Peu d'entretien et rapidit\u00e9<\/li>\n<li>Facile \u00e0 mettre en \u0153uvre<\/li>\n<li>Automatisation.<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>D\u00e9tails cl\u00e9s Acc\u00e9der \u00e0 plusieurs sources de donn\u00e9es avec le scraping de donn\u00e9es. Solution Meilleures pratiques pour un web scraping robuste et r\u00e9silient Technologies et outils Microsoft Azure Cloud Services pour l'h\u00e9bergement, le r\u00e9glage et l'administration de l'infrastructure. Langage Python avec les biblioth\u00e8ques et frameworks n\u00e9cessaires (Azure-sdk, Scrapy, Selenium, etc.) pour le scraping de sites web ... Lire plus <a title=\"Scraping efficace de donn\u00e9es de sites web pour une meilleure gestion des donn\u00e9es\" class=\"read-more\" href=\"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/\" aria-label=\"Read more about Efficient Website Data Scraping for Improved Data Management\">Lire plus<\/a><\/p>","protected":false},"featured_media":2860,"template":"","industry":[77,70],"expertise":[74,65,78,72,58],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.9 (Yoast SEO v21.9.1) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Efficient Website Data Scraping for Improved Data Management\" \/>\n<meta property=\"og:description\" content=\"Key Details Accessing multiple data sources with data scraping. Challenge Fast and accurate data scraping from multiple sources Solution Best practices for robust &amp; resilient web scraping Technologies and tools Microsoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping ... Lire plus\" \/>\n<meta property=\"og:url\" content=\"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/\" \/>\n<meta property=\"og:site_name\" content=\"ESISOC | ESSID Solutions\" \/>\n<meta property=\"og:image\" content=\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/83ede7fb50b04acc8e2536d6b92b7761.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"839\" \/>\n\t<meta property=\"og:image:height\" content=\"514\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\",\"url\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\",\"name\":\"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions\",\"isPartOf\":{\"@id\":\"https:\/\/esisoc.com\/#website\"},\"datePublished\":\"2024-02-18T15:29:24+00:00\",\"dateModified\":\"2024-02-18T15:29:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/esisoc.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Efficient Website Data Scraping for Improved Data Management\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/esisoc.com\/#website\",\"url\":\"https:\/\/esisoc.com\/\",\"name\":\"ESISOC | ESSID Solutions\",\"description\":\"Data Science Consulting and AI | Online Books, Videos, Courses and more\",\"publisher\":{\"@id\":\"https:\/\/esisoc.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/esisoc.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/esisoc.com\/#organization\",\"name\":\"ESISOC | ESSID Solutions\",\"url\":\"https:\/\/esisoc.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/esisoc.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png\",\"contentUrl\":\"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png\",\"width\":350,\"height\":63,\"caption\":\"ESISOC | ESSID Solutions\"},\"image\":{\"@id\":\"https:\/\/esisoc.com\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/","og_locale":"fr_FR","og_type":"article","og_title":"Efficient Website Data Scraping for Improved Data Management","og_description":"Key Details Accessing multiple data sources with data scraping. Challenge Fast and accurate data scraping from multiple sources Solution Best practices for robust &amp; resilient web scraping Technologies and tools Microsoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping ... Lire plus","og_url":"https:\/\/esisoc.com\/fr\/resource\/lextraction-efficace-des-donnees-dun-site-web-pour-une-meilleure-gestion-des-donnees\/","og_site_name":"ESISOC | ESSID Solutions","og_image":[{"width":839,"height":514,"url":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/83ede7fb50b04acc8e2536d6b92b7761.webp","type":"image\/webp"}],"twitter_card":"summary_large_image","twitter_misc":{"Dur\u00e9e de lecture estim\u00e9e":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/","url":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/","name":"Efficient Website Data Scraping for Improved Data Management - ESISOC | ESSID Solutions","isPartOf":{"@id":"https:\/\/esisoc.com\/#website"},"datePublished":"2024-02-18T15:29:24+00:00","dateModified":"2024-02-18T15:29:24+00:00","breadcrumb":{"@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/esisoc.com\/resource\/efficient-website-data-scraping-for-improved-data-management\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/esisoc.com\/"},{"@type":"ListItem","position":2,"name":"Efficient Website Data Scraping for Improved Data Management"}]},{"@type":"WebSite","@id":"https:\/\/esisoc.com\/#website","url":"https:\/\/esisoc.com\/","name":"ESISOC | ESSID Solutions","description":"Data Science Consulting and AI | Online Books, Videos, Courses and more","publisher":{"@id":"https:\/\/esisoc.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/esisoc.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/esisoc.com\/#organization","name":"ESISOC | ESSID Solutions","url":"https:\/\/esisoc.com\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/esisoc.com\/#\/schema\/logo\/image\/","url":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png","contentUrl":"https:\/\/esisoc.com\/wp-content\/uploads\/2024\/02\/logo-esisoc.png","width":350,"height":63,"caption":"ESISOC | ESSID Solutions"},"image":{"@id":"https:\/\/esisoc.com\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/resource\/2859"}],"collection":[{"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/resource"}],"about":[{"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/types\/resource"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/media\/2860"}],"wp:attachment":[{"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/media?parent=2859"}],"wp:term":[{"taxonomy":"industry","embeddable":true,"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/industry?post=2859"},{"taxonomy":"expertise","embeddable":true,"href":"https:\/\/esisoc.com\/fr\/wp-json\/wp\/v2\/expertise?post=2859"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}