1 – First steps – different tools for different jobs.
Although Google is by far the most popular search engine it is not the best for every search and may be a poor choice for ‘background’ work on new topics. Imagine you’ve been commissioned to write an article about the illness caused by the norovirus. This is sometimes called the ‘winter vomiting bug’ and causes outbreaks of sickness in hospitals and cruise ships.
Type the term ‘norovirus’ into Google and you find official government pages and general fact sheets dominate the first 20 results. These 20 results also include a handful of news stories and a couple of references to sites for clinicians. These tools may help you focus faster:
• alltheweb’s new ‘livesearch’ engine provides search results alongside alternative search queries – as you type. This means you don’t have the laborious task of adjusting search terms. The range of alternative terms, for example, include ‘norovirus outbreak’ and ‘symptoms’.
• You can also use Kartoo to choose from a range of suggested linked ‘topics’. Kartoo also lists linked search terms within visual ‘maps’ that plot results within these maps and indicate how they relate to each other. As you highlight a result a small preview of the page appears in the left hand column.
• Clusty ‘clusters’ results according to sub headings. Enter ‘norovirus’ into Clusty and the suggested clusters include ‘litigation’, ‘outbreak’ and ‘cruise ships’. Click on ‘cruise ships’ and you’re given a range of further sub-headings that include ‘passengers and crew’ and ‘gastroenteritis outbreaks’.
• Use Google trends to get a feel for how a story has developed. Enter ‘norovirus’ here and it displays a graph showing search trends for that term. Major news stories related to the search term are plotted on the graph. Interest in ‘norovirus’ peaked when the virus hit the QE2 cruise in January this year.
Kartoo and Clusty are two of many ‘meta-search’ tools which aggregate results from a range of search engines and display the results in different ways. Ixquick is another option. But while meta search engines are a great way to narrow your search, they aren’t precise enough for detailed trawls. This is because meta search tools pull in just a few dozen results from major search engines. The gem of information you are looking for may be impossible to find no matter how many times you refine the search term.
2 – Starting to focus
To narrow your focus further on specific angles we need to turn to specific commands. This section describes those you can use with Google although many of these, or ones like them, work with other search engines.
Google will only return hits that include all your search terms so using the Boolean search term ‘AND’ is not necessary. Google normally ignores small words however, so occasionally you may need to force it to include some words by using the ‘plus’ symbol. For example, a search for ‘charles I’ returns more results if you force it to include ‘I’ in the search.
Other commands you can use:
• Force Google to exclude words. By using ‘norovirus -cruise’ we can search for pages that don’t include information about outbreaks on cruise ships.
• Use the command ‘OR’ to search for pages that contain either of two terms. For example, the search ‘norovirus qe2 OR qeII’ allows for the fact that the cruise ship is described in two ways. Remember – Google is not case sensitive.
• You may need to search for whole phrases. Do this using double quote marks. The search “norovirus litigation” will find that exact phrase and not just pages that contain both of those words.
Finally, always remember that you can use the ‘search within results’ tool at the bottom of Google’s results page. The search ‘norovirus qe2 OR qeII’ returns 805 hits. A search for ‘litigation’ within those results returns 85 hits to explore.
A very neat solution to focusing your search in a niche area is to create a ‘searchroll’. At Rollyo you can create a ‘roll’ of sites and then conduct searches only within those. Your ‘searchroll’ can even be added to your Firefox browser search bar.
3 – Google mining
Sometimes, however, you need to be more precise. In these cases you can use a range of Google tools that can help you identify specific pages and documents containing precise terms. The first place to look is Google’s ‘advanced search’ option. Here you can specify terms to include, exclude and exact phrases. You can also command Google to only return results in specific file formats (PDF, Word, Excel etc) or from specific domains. You can also specify where on the page the search term appears (in the page title or the content for example) and you can specify the date range when the page was indexed or reindexed.
While these advanced pages are useful you can also use a range of advanced search ‘operators’ to hone results down. Once you get used to them you’ll wonder how you managed without.
Type these operator commands into Google’s normal search field before the search term you want to use. Here is a range of the most useful Google advanced operators. Some of these aren’t available as an option in Google’s advanced search page. The ‘operators’ are highlighted.
• norovirus site:www.hpa.org.uk – this restricts the search to pages from the Health Protection Agency’s site. You can also use the operator this way: site:.com (you need to use this operator in combination with a search term).
• inurl:norovirus – will only look for urls that contain the word norovirus. This search: ‘inurl:norovirus qe2′ will look for urls containing norovirus and the term ‘qe2’ anywhere on the page.
• norovirus filetype:pdf – will look only in PDF documents for your term. You can also use ‘xls’, ‘ppt’ or ‘word’ etc.
• link:www.hpa.org.uk – will instantly list all other pages that link to the www.hpa.org.uk page. This operator also works for specific pages. To find out who links to HPA’s page on the norovirus use this: link: www.hpa.org.uk/infections/topics_az/norovirus/menu.htm (this is called reverse link searching).
• intitle: or allintitle: – searches for a word in a web page title or several words.
• inanchor:norovirus – will find the term ‘norovirus’ in html links. You can look for names in this way by using this search for example: inanchor:”Marler Clark”. Marler Clark is the author of the blog on norovirus – Noroblog.
Go here to find more information about some of google’s advanced operators.
So how can we use these operators in a practical way? In his book Find It Online Alan Schlein says: ‘The first major step for any research project is to visualise your destination.’ Imagine that crucial nugget is out there. Picture it in your mind. What kind of site will it be in? What does the document look like? Once you’ve done that, combine the free and powerful tools available to hunt it down. Here are a few examples.
Recently I explored the reintroduction of wild species in Scotland and I wanted to know how many sea eagles had fledged last year. I wanted a reliable source (Scottish Natural Heritage) and I guessed the answer would be found in a published report – most likely a PDF. I used this search to obtain the answer: “sea eagles” fledged 2006 inurl:snh filetype:pdf
Similarly, I wanted to know how many red kites had been illegally poisoned. I found the answer using this search: “red kites” poisoning Scotland site:www.rspb.org.uk
I also looked into the links between the former Energy Minister Brian Wilson and the nuclear industry using this search:
“Brian Wilson” “energy minister” site:.com “non-executive director” That search found that he had been appointed as a director of AMEC Nuclear.
Using the example of norovirus, imagine you want to find out about norovirus outbreaks in UK schools. You could use this search:
norovirus schools outbreak inurl:.gov.uk
And focus it further by searching within results using the term ‘minutes’ which take you to the minutes of official committees that have discussed this issue.
4 – Digging Deeper
Carefully crafted searches can be used to obtain sensitive material and lead to real breakthroughs. Website owners make mistakes. Documents, html pages and whole site directories that should be hidden from view litter the web. But obtaining focused material can be, but is not always, a long and painstaking process. The operators filetype:, inurl:, and intitle:, are particularly important for constructing search strings that reach deep within sites.
There is no room here to explore the dozens of ways to do this but one of the simplest techniques is to look for directory listings within sites. Directory listings can exist to give users alternative access to files or directories – giving a bypass around normal site navigation. Their existence can be intentional or unintentional and sensitive material can be either intentionally or unintentionally left there. They can be easily accessed because they are often titled ‘index of’. If they exist you may find them using Google’s intitle:index.of operator in combination with terms often found in directory listings – terms such as ‘parent directory’, ‘name’ and ‘size’, or ‘last modified’.
However, you must combine that search with other carefully chosen search terms such as ‘minutes’ (of meetings) or subject terms. Be prepared for a lengthy trawl. The people who leave sensitive information lying around directory listings are termed googledorks – a quick cast through UK’s public sector websites reveals the breed is thriving. For more on website security and search engine hacking see Google Hacking for Penetration Testers by Johnny Long.
5 – Anonymity
Most journalists most of the time will not need to worry if their activities online can be traced. But if you are tackling a sensitive subject you may want to keep a low profile by keeping your online presence anonymous. There are products that can help such as anonymizer and free services such as Anonymouse. You can also use the Tor tool to protect your identity online. It aims to offer journalists and NGOs a defence against surveillance. It works by distributing your online communication through a myriad of encrypted links.
Finding the best solution for anonymous surfing is beyond the scope of this article. If you think this is necessary then you need to find the right technical solution and be sure it works.
However, if you still need to be convinced that you need to keep a lower profile then look at Browserspy. This is a free service that carries out a series of tests to check what it can find out about you and your internet connection – your IP address for example. If you want to know what your IP address says about you then go to ip-lookup.net.
Many people are also amazed that Google keeps a log of all of your previous searches. To find more on this look at the search history pages in Google. You’ll need to sign up for access to your own search history. From there you can delete some or all of your searches. Even so, it is clearly not made obvious to Google users that this search engine tracks use in this way.
6: Google alerts
The subject of monitoring future web content is a tutorial on its own. Even so, as this ‘how to’ is focused on search engines we should mention Google’s alert service. You can use this to monitor news, web pages, groups or blogs for keywords or phrases.
Like all email alert tools, you can end up with a cluttered inbox. But if you don’t create too many, and delete outdated alerts when redundant, then they can help you keep track of subjects for key assignments.
7: Google’s cache
Google gives you access to its cache of nearly every search result. You can access this where it says ‘Cached’ next to the url on the last line of each hit in the list of results. Click on this and you can access the page version last indexed by Google. The white information box at the top of the page shows this is the cached version and it tells you when the page was indexed.
Your search terms will also be highlighted in colour through the whole document. A really quick way to search for other terms in the cached version of pages is to add a term directly into your browser’s url alongside your other search terms. You will also need to add a plus symbol before the term. Press return and that new term will also be highlighted throughout the document.
There are two other good reasons why you may want to use the cache. Firstly, if a website owner pulls a site page they want to hide you may still be able to access the information through the cache. Secondly, using the cache is another good way to surf anonymously if you don’t want to alert the site about your visit.
8: Feed engines
Don’t forget that sites such as Technorati, Blogger, Britblog, Blogpulse and Feedster allow you to search for specific blogs, subjects within blog posts and news feeds. Once you’ve found what you’re looking for you can sign up to the feed. This is another way of monitoring key subject areas. Blogpulse plots selected terms on a graph showing the trend in how often these terms appear in specific feeds. Follow the link to Trend Search for that tool.
9: Search engine limitations
Key to undertanding how search engines can help is to know their limitations. You may be better to check the lie of the land using a specialist subject directory than casting around in a search engine or even in a meta search engine. Subject directories include Yahoo! Directory, the UK focused BUBL LINK and the Librarian’s Index to the Internet.
Remember also that search engines have indexed only a small part of the web. Many sites, including newspapers, block search engine crawlers. The restricted access material on those sites is never indexed. Furthermore, search engines don’t index every page on sites they crawl, they don’t have access to most database sites, they are denied access by many sites and many crawlers can’t access some file types. Strategies to access this ‘invisible’ or ‘deep’ web is beyond the scope of this article, but a useful introduction is here.
Another problem is that some search engines are clearly better than others and the algorithms they use to rank results are opaque and are often the subject of furious debate. A little known fact is that some engines include ‘sponsored’ results within their main results list or content (unlike Google and alltheweb which list them in a separate column or box). If you search for ‘airlines’ then you’d probably expect to see a lot of sponsored links using most search engines. But would you expect to see so many sponsored links after running the search ‘skin cancer’? Run that search in Dogpile and you’ll see that the majority of the results are ‘sponsored’ in the main list of hits.
Finally, search engines are mutating and adapting. As this article demonstrates, some search engines do some things better than Google and new services appear all the time. Snap, for example, gives access to page previews. This gives you a much better grasp of what is relevant and that one service leaves google standing.
There are a host of sites dedicated to search engines and new developments. One of the best is Search Engine Detective by Pandia. Using this you can search for key terms (such as search engine names) in all good sites about search engines or among the most influential 25. Use this to get information about key new search engine tools and services.
• Colin Meek will be running a one-day course on Advanced Online Research in London on 16 May 2007. Places are limited so early booking is advised. Click here for more details.
Free daily newsletter
If you like our news and feature articles, you can sign up to receive our free daily (Mon-Fri) email newsletter (mobile friendly).