How to build your own (topic specific) search engine

- Web-Crawler, Meta Search Engine
Aus der Kategorie: Knowledge Base

Q: "I need to explore multiple search engines for information and URLs on relevant medical equipment such as chemistry analyzers and surgical tables."

A: "This topic is more complex then it looks like, and Manuels recommendation needs some clarification."




So, please let me share my thoughts about this topic.
The horrible simple fact is: I cannot suggest a "compact solution"/package but like to do some advertisment (sorry). So for further development on this, simply I need money ;-) Cool

But so far here is some deeper guidance, if you follow it I believe it is easy to resolve your problem.

First, what you actually are you looking for is a "Meta Search Engine" or a "Subject-Specific Web Crawler" (for your damned pharmacy however).
If you just use the google API as suggested by Manuel you will run into restrictions and privacy issues, meaning Google first will track all activity of your client (this propably will happen also, if you use examples I suggest below and stay legal, but nevermind so far), second Google will personalize your client what will affect your SERP

And finally and further, a more complex Web Crawler will enable you first to include sources from more than one provider in your serps, second to fake user agents, user agent sessions, IPs and local settings, proxys and , and third you may want to process the fetched search results further and spider the SERPs webpages delivered by the search engines result pages, parse and analize them on your own meta-search-engines topic specific criterias, follow links and
finally build your own pharma database based on your web.
Also remember there are lots of sources like topic specific RSS-Feeds and so on you will like to spider moderate/apply them to your SE manually!
If you used and combined a number of search engines, found websites and ways to access them, by meta-search engine, fake client or by just using the common APIs, you want to process all that data, analize text and meta-tags and provide prepared views and search results to your End-User.



Technically: In another thread on a different topic I discussed with Manuel about the fact that Google for instance changed its result pages from syncronous HTML to javascript and the links from query parameters to #hashes, so someone could think, processing the serps just with PHP could not work anymore and things like node.js/javascript VM would be needed.
So on the one side there are several common solutions available now in 2016 to do so, but on the other side I have to report that our conjecture about the serps beeing delivered asyncronously IS NOT ACTUALLY TRUE: I HAVE a meta search engine in use and it CAN HTML by requesting ?q=queryparameters and without asyncronous clientside /#!HASHBANGS needed!




REQUESTS (e.g. when taking a search):

  • A "normal" request to google search, like /?q=searchterm
  • A request to each of the found links in the result page of the previous request to google (and processing it links later...)
  • Cached requests to the official Bing API, on text results, images, videos,...
  • Periodical automated requests to a few selected breaking news portals
  • Periodical automated requests to a few selected RSS-Feeeds
  • Query the web-crawlers result DB
  • Query a lot of internal DBs and tables, e.g. domain specific and user generated data stocks

PREPROCESS of the results:

  • Split the results into topics/modules
  • Performing SQL FULLTEXT search on the results
  • Parsing and analyzing text and html/metatags of the found pages
  • configurable ranking of the results based on the criterias
  • Calculate the "SEO-Performance" of a few selected sites
  • Compute the most important buzzwords of the breaking-news headlines of the moment
  • Extract links for later crawling

MISC:

  • Searchformular Autocomplete
  • OpenSearchDescription.xml (to register search engine in browser)
  • No "faking clients" or "special tricks", when requesting sites the user agent and the IP is indicating my crawler as my metacrawler, everything is legal and fair!
  • Does not spy on user activity, comply with the Webfan Privacy Policy and the BDSG

Although your request on building a search for "medical, chemistry, surgical products" sounds a bit scaring to me, nevermind, currently I am qualified and purchasable.


To get started building your own Web-Crawler, I recommend the following package to you:
Using the Http Client Class by Manuel Lemos you will be able to develop a Search Engines Crawler Client or any other kind of bot or proxy in PHP with ease.
You will find many other helpful classes related to query websites, SEO and API-using (e.g. Google APIs) on phpclasses.org.






Erstellt von WEBFAN (Monday 1st of August 2016 03:29:24 PM - vor 478.27 Tagen)
in der Kategorie Knowledge Base als statische Seite
Veröffentlich/Freigeschaltet: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Zuletzt geändert: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Der Beitrag wurde insgesamt 1915 mal gelesen (durchschnittlich 4 mal am Tag)

Bewertung des Beitrages: - Noch keine Bewertung - von 10 Punkten (bei 0 Stimmen)

Für Benachrichtigung über neue Beiträge aus der Kategorie Knowledge Base:
Jetzt kostenlos als Benutzer von "frdl" registrieren...!

Kommentar zu diesem Beitrag verfassen:

Dein Name (* Pflichtfeld):


Deine Website:
(mit http://)


Deine E-Mail Adresse:
()


Track Back Url:
(mit http:// - Auf dieser Url hast Du auf den vorliegenden Artikel verlinkt)


Bewertung des Artikels abgeben:
(10=besonders gut | 0=besonders schlecht)


Dein Kommentar zu diesem Beitrag (* Pflichtfeld):

Html erlaubt: a> <b> <blockquote> <br> <center> <div> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr> <i> <img> <ul> <li> <p> <pre> <small> <sub> <sup> <table> <td> <tr> <u> <strong> <span> <nodocu> <docu> <wemc> <dl> <dt> <dd> <abbr> <em> <tbody


Bitte mit TAN B4 bestätigen:



Bewertung des Beitrages: - Noch keine Bewertung - von 10 Punkten (bei 0 Stimmen)

Kommentare zu diesem Beitrag:


- keine Kommentare zu diesem Beitrag vorhanden -