Learning Search

NANFWRIMO #4 & #5: Strategies for Low-Stress Web Monitoring

Note: this article is so long that I’m counting it as two days’ worth of NANFWRIMO. Next article on Monday.

The Web is enormous. Billions and billions of pages, if I can get all Carl Sagan on you, with more being added every second. When you’re trying to monitor these new Web pages for information, the potential flood of data can be daunting. So many pages, irrelevant to your topic but glomming on to your search terms! So much spam! How can you possibly find the good stuff amidst all the dreck?

The truth is this: the Web is still great for finding new and useful information. With the increase in its size, however, and the amount of gaming and spamming going on, you have to think strategically about how you’re going to monitor its new pages. You can’t just put a couple of keywords in a Google Alert; you’ve got to be a bit more savvy. That’s what this article’s about: seven tactics to give you a rich flow of data from new Web pages without wasting a lot of time and tears on off-topic pages, junk, and spam.

For the purposes of this article I’m using Google Alerts. It has its shortcomings, but in terms of completeness I think it’s the best option out there for Web monitoring. (If you have alternative suggestions please, put them in the comments!) If you’ve got a Google Account, you’ve got access to Google Alerts – you can access it at https://www.google.com/alerts and it’s free.

(If you’ve never used Google Alerts before and need an overview, WikiHow can help you out.)

1. Choose Your Search Terms Wisely

If your search terms are off, Google Alerts isn’t going to do much for you, so this first tip is the most important: Get your search terms right. Your goal here is a balance between the very specific terms that would get you exactly what you want but would very rarely come up in a Web page, and the broader terms that would get you interesting results but also a lot of uselessness to eliminate.

Think about your topic terms. Write them out. Which ones do you see a lot in the resources you review? Which ones do you rarely see? Google Alerts will give you a preview of results when you set up an alert. Test your terms. Which ones are giving you resource-rich, “meaty” results? Which ones are junk?

Sometimes it’s difficult to narrow down your search terms. For example, I want to learn about new digital archives. “Digital archive” is a general term, but at the same time it would be difficult for me to narrow my focus and still get a good spectrum of results. I can use more obscure terms, like “digital library,” but that’s more changing vocabulary than getting specific. If you really can’t think of a way to create pinpoint search terms for your topic, don’t worry; we’ll look at other ways to narrow your search in this article.

Even if you can’t come up with focused search terms for your topic of interest, see what you can do by adding time-related words to your query. For example, if I set up a Google Alert for “search engine”, I would spend hours going through junk results. But what about “new search engine”? Much better results. Fewer of them and they’ll be more timely. Maybe recent, updated, or latest can be integrated into your search terms and phrases.

2. Limit by Domain

You just can’t limit your search terms. Your topic is either very broad or defies specific description. That’s okay. Shift your attention to limiting where Google Alerts will find results.

Because my interests are in digital archives, online databases, etc. I find that focusing on certain top-level domains gives me quality results in workable quantities. Like this:

“digital archive” (site:edu | site:gov | site:mil | site:museum | site:nc.us)

(Google does not need the parens to correctly parse this search. But it helps me organize my thinking and, in the cases where I’m doing very complex searches, helps me understand them when I go back and review months after creation.)

In this case I’m doing the fairly broad search for digital archive but limiting the results to only a few top-level domains, including .edu, .gov, and municipal/government sites in the state of North Carolina. This gives me a manageable flow of results.

Protip: When I limit my Web monitoring to only a few top-level domains, I do it with the knowledge that I am going to miss resources made available on .com and .org sites as well as other domain. I feel comfortable doing that because I also monitor news, social media, and RSS feeds. You must monitor multiple aspects of the Internet because some tools and terms will work better than others.

3. Limit by Area of Page Monitored

The messy HTML of a Web page is not as delineated as, say, an XML file. But it still has an identifiable title. When you’re frustrated with too many useless results from a Google Alert, give it a laser-like focus by searching only the titles of pages:

intitle:”new search engine”

You have instantly cut down the data pool that Google Alerts is searching by at least 90% when you’re monitoring only Web page titles.

Now, can you combine the ideas of narrowing your search by domain AND by page title? Yes. In fact, I’ve found this a great way to monitor Reddit. Visiting the site and browsing, even using the search engine, takes a lot of time and doesn’t find much. But this search in my Google Alerts generates a trickle of great resources I rarely find anywhere else:

intitle:”new tool” site:reddit.com

Or maybe I’ve got a keyword that’s kind of general so I want to limit my search to just blogs:

intitle:”new web” (site:blogger.com | site:wordpress.com | site: typepad.com | site:livejournal.com)

Do you see what you’re doing here? You’re using special syntax to trim the billions-of-pages Web into manageable chunks. But you don’t even need special syntax to do that; you can also limit your results by excluding keywords.

4. Your Exclusions are As Important As Your Inclusions

When it comes to Web searching this isn’t said often enough: What you exclude is just as important as what you include. If you’re trying to monitor the Web and you keep getting the same old junk, consider that maybe your included search terms are fine and you need to exclude some words from your alerts.

I want to monitor news releases at certain top level domains – basically I’m looking for press releases. At the same time I don’t want to get re-indexed archive pages, or job announcements. So I use this:

intitle:news release press contact -intitle:”blog archive” -inurl:jobs -intitle:”job vacancies” (site:edu | site:gov | site:mil)

This gets me what are generally resource announcements and skips job appointments and personnel-type stuff. When you find your Google Alerts are getting you useless results that are of a specific type, go through them and see if there are any keywords that you can use to exclude that class of results entirely.

Speaking of special syntax and language, did you notice I used the special syntax inurl in that last example?

5. Inurl: Is Tricky But Useful

The inurl syntax searches for a character string in a page URL. Using it judiciously can mean short searches which yield information-rich results:

intitle:”digital archive” inurl:library site:edu

But this syntax requires caution. I can use inurl:library in the search above because many university libraries delineate their sites this way. It’s not quite standard, but it’s common enough that it works. You may find that your inurl searches are not standard or common enough, and end up eliminating a lot of resources you’d otherwise find. You’ll have to experiment.

Of course, sometimes your keywords are just too general for the Web, and you have to take strong measures to winnow down the data pool you’re monitoring.

6. Shift Your Focus to a Smaller Data Pool

Google Alerts monitors not only the Web but also news, blogs, what it calls “discussions,” and a few other Internet subsets. If you find, even after experimenting and revising your search, that you’re still getting too many non-useful results, consider shifting your Google Web alert to a Google News alert. You won’t get as many results, but they will be more focused and generally more timely.

Protip: Generally I let my Google Alerts monitor everything – News AND blogs AND Web and so on. It’s best to let these alerts run together. However, if you find that you’ve got a search term that attracts a lot of spam or off-topic results, setting your Google Alerts to bring your results only from News sites will generally get that topic back on track.

7. Regularly Revise and Update

Did you read Harry Potter and the Goblet of Fire? Let me channel Alastor Moody for a moment:

“CONSTANT VIGILANCE!”

The searches you ran in 2001 probably did not look like the searches you ran in 2010, which in return probably did not look like the searches you run now. Even if your topics of interest don’t change, the resources you would use online to access and discuss them will change. Make sure that you’re regularly going through your alerts to see if you’re using keywords that are outdated and will limit your results.

At the same time, check to see if you might need to add some keywords. Since I cover a lot of Internet resources in ResearchBuzz, I want to be informed about new articles, tweaks, and updates about a number of tools that have been around for less than a couple of years.

When livestreaming became a big deal, I made sure I was covered in my Google Alerts:

(intitle:periscope | intitle:snapchat | intitle:meerkat | intitle:vine ) (site:edu | site:gov | site:mil)

When I discovered I was finding good content about new Facebook extensions buried in pages about other things, I made sure that topic got more prominence:

“new facebook tool”

When I started getting more and results from Academia.edu, I decided to break those results off into their own alert:

(“digital archive” | “online database”) site:academia.edu

At the same time I might stop monitoring other topics. Bookmarking, for example, is not the topic it used to be. I’ve noticed that there’s less talk about online museums and more talk about digital archives. Digital library is another search term that seems to have gained in prominence lately, if my Google Alert results are any indication.

When you monitor topics regularly over a long period of time you will get a familiarity that’s not quite conscious. You’ll notice patterns. Something will bug you and you’ll realize you haven’t seen a particular keyword mentioned in quite a while. Another time you’ll see someone’s name and realize you’re seeing that name affiliated with a certain topic an awful lot — even though that name is not particularly famous or talked about. Use this developed knowledge — I’ve called it “spidey sense” — to refine your Google Alerts.

The size of the Web can be intimidating. Use the tools Google gives you — special syntax, the ability to exclude words, and restricting your search to only certain areas of a page — to get your results down to a usable level that lets you spend more time using what you find, and less time wading through page after page of useless results.

Categories: Learning Search, News, Rants

Tagged as:

7 replies »

  1. I used to work for a company that did… something similar to automated Google alerts. Pick examples of the documents (web pages, etc.) that you were interested in, and we would create a profile based on an analysis of your selections.

    I found a Google Alerts creator (http://www.hudsonatwell.co/downloads/google-alert-creator-3-0/) that seems to handle the “create an alert”. Lacks the analysis portion, though.

  2. This is 100% awesome. Thank you!

    (It’s also excellent, btw, just as a guide to using Google syntax in general. I can’t tell you how many people have looked at me like I was nuts when I tried to tell them about inurl, intitle, etc.)

Leave a Reply