The great strength of the Web is that anyone can contribute to it. And the great weakness of the Web is that anyone can contribute to it.
The weakness part of that equation is dominating the headlines at the moment, as everybody and their little cat Francis has launched some kind of AI-writing tool. Write with no effort, the ads say. Create SEO-friendly content! Make big bucks with targeted advertising!
Drown the Internet in bilge!
The shallow, ill-informed junk that AI is spewing into the Web – what I call infosewage – is bad for people who are trying to perform useful Web searches. Unfortunately, it’s good for low-effort endeavors to make money off advertising, so it’s not going anywhere until online advertising companies make it impossible to profit from uncredited, rechewed content. I’m not holding my breath.
What I am doing is developing tools to avoid the infosewage. You don’t have to create an opaque algorithm or develop your own specific set of sites to find useful content on the Internet. Instead, you can take advantage of the authoritative structures that are all around us and use them to guide you to Web spaces that contain much more information and much less crud.
In this article, I’m going to show you nine different search tools that you can use to help you keep the search result garbage to a minimum and perform more useful queries using the principle of authoritative structures. Let’s start with that idea: the authoritative structure.
What’s an Authoritative Structure?
When I talk about authority in this context, I just mean a person/group/institution that has rights and powers that most people don’t. The FCC, for example, issues licenses for TV stations in America. Nobody else can do that. You can use the FCC to confirm that a Web site belongs to a TV station because the FCC maintains a license database (an authoritative structure.)
Another example of authoritative structure is the EDU top-level domain. Only institutions of higher education are allowed to use that domain – you and I can’t go to a domain registrar and register SomeRandomWebsite.edu . (UPDATE: A reader points out that the .edu domains can be used with institutions affiliated with education (like the Smithsonian) and in some cases even institutions focused on K-12 education. These uses of the edu TLD are uncommon but they do exist, and I should note them.) Same goes for other top-level domains like MIL and GOV – their use is restricted to specific groups.
Just because a website is created in or aggregated around an authoritative structure doesn’t mean that every word on that site is going to be correct and true. What it does mean is that there is a barrier to entry for that site which makes it more difficult for infosewage to infest it.
(If you’re interested to read more on authority and authoritative structures in Web search, I have fully nerded-out on the topic here.)
Is Wikipedia an Authoritative Structure?
Wikipedia is open for pretty much anybody to edit, so it’s not an authoritative structure the way I define it in this article. On the other hand, Wikipedia has a strong framework for requiring people to be responsible for their edits. When you find information on Wikipedia, you can trace back to when it originated and from what user. Because of that, Wikipedia has more transparency than the Web (though not more authority.) I use Wikipedia to guide my searches and take me quickly to reference type information (like social media contacts, biographical data, etc) but I don’t lean on it like I would the FCC’s licensing database.
Let’s start with news and web search.
Finding More Search Results, Less Sewage
Junk in web and news search results is nothing new. I was complaining about it in 2019. I was noting drug spam in 2017. What’s new is that allegedly-credible news sites, like CNET and Gizmodo, have been using AI to generate stories that have errors. Other unreliable “news” sites are showing up with AI-generated content.
There are three strategies you could use to avoid this kind of material. The first is to limit your search to a set of sites you found in an authoritative structure. The second is to choose from a specified, easily-adjustable set of sources to search, so you know where your information is coming from and can change your sources quickly. And the third way is to use external information to define your search in a way that an infosewage-generator can’t emulate. Let’s look at tools for each strategy.
Strategy 1: Limit to Authoritative/Confirmed Sites
Marion’s Monocle – https://searchgizmos.com/mm2/
Marion’s Monocle is named after Marion Stokes, who recorded television news for over 35 years and ended up with a huge collection of over 70,000 video tapes. Marion’s Monocle uses the FCC licensing database to identify American television stations by city / state, then aggregates the stations you choose into a Google or Google News search using Google’s site: syntax. You can choose to search for recent stories indexed by Google News or you can search the stations’ web space more generally.
Once you’ve generated the Google or Google News site: search, you can add and refine query terms.
Because of Google’s search limitations, Marion’s Monocle searches only up to 10 station websites at a time, and besides that there’s a limited number of television stations in America anyway – maybe 1700-1800? So while this tool won’t bring you tons of search results, the results you get will be from sites certified by the FCC to belong to television stations. You’ll know they’re real and you’ll know they are located where they say they are located.
If you want to do a more general search that is still based in authority, try Super Edu Search.
Super Edu Search – https://searchgizmos.com/super-edu-search/
As I noted earlier in this article, EDU top-level domains aren’t available to just anybody. You have to be a certified institution of learning. That makes .edu sites less susceptible to infosewage (though they can still be hacked or host to public forums where people post nonsense.)
It’s easy to add site:edu to a Google search, but it’s so limited! You can’t focus your searches by geography or religion or other demographics. I really hated that so I made Super Edu Search. Super Edu Search uses a dataset from Data.gov to let you choose, very specifically, what kind of site:edu searches you want to make. Maybe you want to search all the public universities in Colorado or all the HBCUs in North Carolina. No problem!
Once you run the search you’ll often find that your results tend to be location-oriented, though that will vary depending on the kinds of parameters you chose to use.
Because .edu websites have LOTS of content aggregated by LOTS of people in a way that isn’t always clearly traceable, due diligence is still required when reviewing information from edu sites. But restricting your search to edu sites will remove some infosewage, and using Super Edu Search to refine it further will hopefully make reviewing the content you find less onerous.
The reason these tools help restrict garbage in your search results is because you’re searching websites that are restricted in various ways. Sometimes those restrictions don’t work, though. Sometimes they stymie your searches because you just can’t find a webspace relevant to your search topics. In that case move to strategy 2: “Choose from a specified, easily-adjustable set of sources to search, so you know where your information is coming from and can change your sources quickly.” Wikipedia makes this easy!
Strategy 2: Generate a Source List
You want to search the news. You want to search more than TV stations. What to do? Get a news source list from Wikipedia and build your searches from that!
I’ve made two versions of Non-Sketchy News Search but they do basically the same thing: find news sources on Wikipedia and turn them into a search on Google. Let me show you how each of them works.
Non-Sketchy News Search v1 – https://searchgizmos.com/nsns/
I made the first version of Non-Sketchy News Search when I didn’t really know what I was doing, so it has kind of a bizarre search mechanism. Instead of keyword searching to find news sources, you search for a source that already exists to find all the other sources in that category.
In this case, I searched for Durham Herald Sun, which provided me with a list of categories in which that newspaper appears. When I pick a category I can choose from those sources to bundle into a Google search.
Because you have to know the source name of the outlet you’re looking for, you can’t use this version of NSNS for areas you’re unfamiliar with. I’ve had some success just entering in paper names and exploring categories. Searching for The Japan Times, for example, found me a whole category of English-language newspapers in Japan.
The second version of NSNS is happily less complicated.
Non-Sketchy News Search v2 – https://searchgizmos.com/nsns2/
Non-Sketchy News Search v2 works much more sensibly. Enter a keyword with which you want to search for publications and specify a Web search query. NSNS will return a list of news sources matching your keyword, along with additional information and a URL in case you want to visit it before you search.
When you use Wikipedia to create a list of news sources to search, you’re not accessing the authority of something like the FCC license database or the exclusivity of an .edu domain. On the other hand, Wikipedia offers a history of its articles and links. If you question a news source, you can check its history in Wikipedia. Who wrote it? How has it been edited? Is it an old source, or very new? If you find the answers unsatisfactory it’s easy to remove a source from your search and replace it with another one.
The first two strategies I talked about to avoid infosewage while searching the web dealt with building source lists. This last strategy focuses instead on directing your search in a way that can’t be copied or countered by infosewage. You can do that with a fun Gizmo called Gossip Machine (which also has two versions.) Please note that Gossip Machine only works with topics covered by Wikipedia, so its usefulness is limited to that (pretty big) sphere.
Strategy 3: Guide Your Search With External Data
Say you wanted to do a web search for a famous person, like American musician Lady Gaga. If you just searched her name you’d get an ocean of results. Even if you limited your search to Google News you would still get tons of results, and the usefulness would be all over the place.
But what if you had a way to pinpoint when Lady Gaga was attracting an especial amount of attention? What if you had a way to discover days that would be more likely to have relevant news about her?
You do, via Wikipedia’s page views. Wikipedia’s page views, which are available back to 2016 via the Wikipedia API, can be thought of as fossilized attention – a dataset to mine for times when Wikipedia pages had especially high views. Once you determine those times, you can use them to create a pinpointed, date-based Google or Google News search that will take you, hopefully, to lots of useful news stories and a minimum of sewage. That’s what Gossip Machine is all about.
Like Non-Sketchy News Search, Gossip Machine has two versions. The first version searches by year. The second version searches by month.
(If you’re interested in another full nerding-out about a search topic, I wrote about using popularity and expertise to guide search here.)
Gossip Machine v1 – https://searchgizmos.com/gossip-machine/
The first version of Gossip Machine has you specify a Wikipedia page title, a year, and whether you want to get more or fewer results. And few means few – a search for Lady Gaga in 2022 found only three results.
Now of course Ms. Gaga had much more going on in 2022 than three dates would indicate. These are the dates that Gossip Machine found the most important based on Wikipedia’s page views. Ideally these dates are information-rich and focused on Lady Gaga. If you try the Google News search for March 28, you’ll see that this is indeed the case:
The first version of Gossip Machine is a little janky because it’s one of the first Gizmos I made. I redid it this year with Gossip Machine 2, which takes a monthly approach to the search.
Gossip Machine v2 – https://searchgizmos.com/gm2/
The second version of Gossip Machine searches a month at a time instead of a year at a time, but provides much better visual indicators about how unusually high a returned day’s page views are: Searching Lady Gaga in March 2022 finds two busy days, March 28 and March 29. The red bar shows the Z-Score, showing how much more busy that day’s page views were over the mean for the month.
Was that March 29 date also full of Gaga News? Absolutely.
Gossip Machine isn’t perfect – it’s limited to Wikipedia topics and it only goes back to 2016. Further, if a Wikipedia page gets relatively few page views – 1000 a day or so – Gossip Machine can toss up wonky results. But when you’re searching a very popular topic or person and you need to cut through a lot of search noise, this Gizmo can really work some magic.
All the searching I’ve been talking about so far has been on the in-depth side – searching you would do when you wanted to explore or flesh out your knowledge of a topic.
But that isn’t the only kind of searching you do on a daily basis. There’s also the quick-hit searches that you do – when you need a company’s official social media site, for example, or some biographical data. Infosewage can hinder those types of searches or in the worst scenarios redirect them in a harmful way.
Wikipedia collects a lot of little reference points about each article entity via the Wikidata structure data service. Wikidata contains a lot of information that you’d normally find via a quick web search, but it’s not easily accessible via Wikipedia’s regular page structure, especially in bulk. So I made a few tools for getting this data in a way that allows you to skip those Google reference searches entirely. Let’s talk about pulling data from Wikipedia with MegaGladys, RoloWiki, and Sheet-Shaped Wikipedia.
Dodging Infosewage With Wikipedia Gizmos
MegaGladys – https://searchgizmos.com/megagladys/
Need to know the URL for Lady Gaga’s SoundCloud? Instead of searching Google, check in with MegaGladys. MegaGladys (named after Gladys Kravtiz on the old Bewitched TV show) aggregates information about Wikipedia topics and presents it to you in a list. Search for Lady Gaga and you’ll get a picture, brief biographical information, and a list of official sites. There’s also reference information from credible sources like the Library of Congress and WorldCat. (How much information is listed depends on the topic you search for.) Finally, a cut-down version of Gossip Machine shows possibly-newsworthy dates for the topic in the last couple of years.
MegaGladys is a good way to quickly find official site information for a famous person or institution without having to wade through search result junk. But you have to know exactly who you’re searching for. If you’re looking more for information about a general category of people – CEOs of tech companies, for example, or active politicians in Virginia – you’ll find RoloWiki more useful.
RoloWiki – https://searchgizmos.com/rolowiki/
RoloWiki lets you specify a Wikipedia page and then shows you the content of that page. The difference is that the internal links to other Wikipedia pages are replaced with a function call that extracts a predetermined list of available Wikidata properties about that Wikidata link (first and last name, date of birth, occupation, official website, Library of Congress reference ID, Wikimedia Commons category, LinkedIn ID, Facebook account, and Twitter account.) The area in which the Wikidata properties appear remains static so you can easily browse the entire Wikipedia page. Clickable links open on the overlay in a new tab.
This means you can do something like load Wikipedia’s page of current United States senators and get contact/official outlet information for each name you click.
RoloWiki works best for people but you can also get plenty of information for locations, businesses, and institutions as well. How much information is available depends on the topic; I find that famous people and politicians have the most information and for everyone else it varies.
MegaGladys provides a way to get a lot of official information about one Wikipedia topic, while RoloWiki gives you a way to browse for contact information across a variety of Wikipedia pages. But what if you want to actually download the data and do something else with it? For that you want Sheet-Shaped Wikipedia.
Sheet-Shaped Wikipedia – https://searchgizmos.com/ssw/
Sometimes you do not want to browse. Sometimes you want to hoover up data and go about your business. Sheet-Shaped Wikipedia is here to help. It works by category: supply a list of categories from which you want to extract data and choose from a list of Wikidata properties you want to extract (there are 17 available consisting of the official website property and a lot of social media properties.) SSW will go through each page in the category, extract the property of your choice, and save it to a carat-delimited text file. You can choose to keep one text file for each category, or merge all categories into a single text file. Sheet-Shaped Wikipedia gathers up lots of social media and website information without having to do any kind of Web search!
The amount of data the world generates every day is staggering. The amount of garbage we have to wade through to get to it is also staggering. Still, there are ways we can use existing authoritative structures to create and define web spaces that are less vulnerable to rising tide of infosewage. I hope these tools are useful to you.
Categories: Learning Search