Repository logo
 
Publication

Geographical partition for distributed web crawling

dc.contributor.authorExposto, José
dc.contributor.authorMacedo, Joaquim
dc.contributor.authorPina, António
dc.contributor.authorAlves, Albano
dc.contributor.authorRufino, José
dc.date.accessioned2008-02-25T14:46:03Z
dc.date.available2008-02-25T14:46:03Z
dc.date.issued2005
dc.description.abstractThis paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server. A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.en
dc.identifier.citationExposto, José; Macedo, Joaquim; Herzog, Pina, António, Alves, Albano; Rufino, José (2005). Geographical partition for distributed web crawling. In International Conference on Information and Knowledge Management. Bremen, Germany. ISBN 1-59593-140-6en
dc.identifier.slugInternational Conference on Information and Knowledge Managementen
dc.identifier.urihttp://hdl.handle.net/10198/526
dc.language.isoengen
dc.language.rfc3066engen
dc.peerreviewedyesen
dc.publisherACMen
dc.relation.publisherversionhttp://www.tzi.de/sites/CIKM2005/programs.htmlen
dc.subjectWeb miningen
dc.subjectParallel crawlingen
dc.subjectWeb partitioningen
dc.titleGeographical partition for distributed web crawlingen
dc.typeconference paper
dspace.entity.typePublication
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/Orçamento de Funcionamento%2FPOSC/POSI%2FCHS%2F41739%2F2001/PT
oaire.fundingStreamOrçamento de Funcionamento/POSC
person.familyNameExposto
person.familyNameAlves
person.familyNameRufino
person.givenNameJosé
person.givenNameAlbano
person.givenNameJosé
person.identifier.ciencia-idDA10-808F-99EA
person.identifier.ciencia-id281A-DD4A-2605
person.identifier.ciencia-idC414-F47F-6323
person.identifier.orcid0000-0003-3857-6083
person.identifier.orcid0000-0001-9796-6810
person.identifier.orcid0000-0002-1344-8264
person.identifier.scopus-author-id56619498700
person.identifier.scopus-author-id55947199100
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.nameFundação para a Ciência e a Tecnologia
rcaap.rightsopenAccessen
rcaap.typeconferenceObjecten
relation.isAuthorOfPublication66fd8128-90b1-4754-936e-2d9e9e0829ec
relation.isAuthorOfPublication80d7f985-d700-4911-8974-b2678816db35
relation.isAuthorOfPublication1e24d2ce-a354-442a-bef8-eebadd94b385
relation.isAuthorOfPublication.latestForDiscovery66fd8128-90b1-4754-936e-2d9e9e0829ec
relation.isProjectOfPublication91cd79d6-e593-4d41-8de8-6bc3bdc5e492
relation.isProjectOfPublication.latestForDiscovery91cd79d6-e593-4d41-8de8-6bc3bdc5e492

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GIR2005-exp.pdf
Size:
180.21 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.81 KB
Format:
Item-specific license agreed upon to submission
Description: