How to build your own (topic specific) search engine

- Web-Crawler, Meta Search Engine
Aus der Kategorie: Knowledge Base

Q: "I need to explore multiple search engines for information and URLs on relevant medical equipment such as chemistry analyzers and surgical tables."

A: "This topic is more complex then it looks like, andManuels recommendationneeds some clarification."

So, please let meshare my thoughtsaboutthis topic.
The horrible simple fact is: I cannot suggest a "compact solution"/package but like to do some advertisment (sorry). So for further development on this, simply I need money ;-)Cool

But so far here is some deeper guidance, if you follow it I believe it is easy to resolve your problem.

First, what you actually are you looking for is a"Meta Search Engine"or a"Subject-Specific Web Crawler"(for yourdamned pharmacyhowever).
If you just use the google API as suggested by Manuel you will run into restrictions and privacy issues, meaning Google first will track all activity of your client (this propably will happen also, if you use examples I suggest below and stay legal, but nevermind so far), second Google willpersonalize your clientwhat will affect yourSERP

And finally and further, a more complexWeb Crawlerwill enable you first to include sources from more than one provider in yourserps, second tofake user agents, user agentsessions,IPsandlocal settings,proxysand, and third you may want toprocessthe fetched search results further andspiderthe SERPs webpages delivered by the search engines result pages,parseandanalizethem on your ownmeta-search-engines topic specific criterias,follow linksand
finally build your own pharmadatabasebased onyourweb.
Also remember there are lots of sources like topic specificRSS-Feedsand so on you will like tospidermoderate/applythem to your SEmanually!
If you used and combined a number of search engines, found websites and ways to access them, by meta-search engine, fake client or by just using thecommon APIs, you want toprocess all that data,analize textandmeta-tagsand provideprepared viewsandsearch resultsto your End-User.

Technically:In another thread on a different topic I discussed with Manuel about the fact that Google for instance changed its result pages from syncronous HTML to javascript and the links from query parameters to #hashes, so someone could think, processing the serps just with PHP could not work anymore and things like node.js/javascript VM would be needed.
So on the one side there are several common solutions available now in 2016 to do so, but on the other side I have to report that our conjecture about the serps beeing delivered asyncronously IS NOT ACTUALLY TRUE: IHAVEameta search enginein use and itCANHTMLby requesting ?q=queryparameters andwithoutasyncronous clientside/#!HASHBANGSneeded!

REQUESTS(e.g. when taking a search):

  • A "normal" request to google search, like /?q=searchterm
  • A request to each of the found links in the result page of the previous request to google (and processing it links later...)
  • Cached requests to the officialBing API, on text results, images, videos,...
  • Periodical automated requests to a few selected breaking news portals
  • Periodical automated requests to a few selected RSS-Feeeds
  • Query the web-crawlers result DB
  • Query a lot of internal DBs and tables, e.g.domain specificand user generated data stocks

PREPROCESSof the results:

  • Split the results into topics/modules
  • Performing SQL FULLTEXT search on the results
  • Parsing and analyzing text and html/metatags of the found pages
  • configurable ranking of the results based on the criterias
  • Calculate the "SEO-Performance" of a few selected sites
  • Compute the most important buzzwords of the breaking-news headlines of the moment
  • Extract links for later crawling


  • Searchformular Autocomplete
  • OpenSearchDescription.xml (to register search engine in browser)
  • No "faking clients" or "special tricks", when requesting sites the user agent and the IP is indicating my crawler as my metacrawler, everything is legal and fair!
  • Does not spy on user activity, comply with the Webfan Privacy Policy and the BDSG

Although your request onbuilding a search for "medical, chemistry, surgical products"sounds a bit scaring to me, nevermind, currently I am qualified andpurchasable.

To get started building your ownWeb-Crawler, I recommend the following package to you:
Using theHttp Client Classby Manuel Lemosyou will be able to develop a Search Engines Crawler Client or any other kind ofbotorproxyinPHPwith ease.
You will find many other helpful classes related to query websites, SEOandAPI-using(e.g. Google APIs)

Erstellt vonWEBFAN(Monday 1st of August 2016 03:29:24 PM - vor929.53Tagen)
in der KategorieKnowledge Baseals statische Seite
Veröffentlich/Freigeschaltet: Monday 1st of August 2016 04:56:17 PM vonWEBFAN
Zuletzt geändert: Monday 1st of August 2016 04:56:17 PM vonWEBFAN
Der Beitrag wurde insgesamt1931mal gelesen (durchschnittlich2.08mal am Tag)

Bewertung des Beitrages:- Noch keine Bewertung -von 10 Punkten(bei 0 Stimmen)

Für Benachrichtigung über neue Beiträge aus der KategorieKnowledge Base:
Jetzt kostenlos als Benutzer von "frdl" registrieren...!

Kommentar zu diesem Beitrag verfassen:

Dein Name(* Pflichtfeld):

Deine Website:
(mit http://)

Deine E-Mail Adresse:

Track Back Url:
(mit http://- Auf dieser Url hast Du auf den vorliegenden Artikel verlinkt)

Bewertung des Artikels abgeben:
(10=besonders gut | 0=besonders schlecht)

Dein Kommentar zu diesem Beitrag(* Pflichtfeld):

Html erlaubt: a> <b> <blockquote> <br> <center> <div> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr> <i> <img> <ul> <li> <p> <pre> <small> <sub> <sup> <table> <td> <tr> <u> <strong> <span> <code> <nodocu> <docu> <wemc> <dl> <dt> <dd> <abbr> <em> <tbody

Bitte mit TANA4bestätigen:

Bewertung des Beitrages:- Noch keine Bewertung -von 10 Punkten(bei 0 Stimmen)

Kommentare zu diesem Beitrag:

- keine Kommentare zu diesem Beitrag vorhanden -