Introduction
A comprehensive database consisting of a vast collection of papers and books in a particular field undeniably holds great significance for subsequent research, including:
Laying a good foundation to facilitate subsequent analysis.
Constructing a foundational knowledge structure about the field. This entails acquiring a fundamental understanding of the key concepts, theories, and principles that underlie the subject matter.
Conducting exploratory data analysis (EDA) to gain initial insights into the field. By performing EDA, researchers can identify prominent authors, discern publication trends, identify highly influential papers, and determine the most frequently cited journals in the field. These findings provide a preliminary grasp of the field's landscape and its significant contributors.
Collecting all related papers in a specific field manually can be a laborious task. Fortunately, there are open source programs available that can automate this process. One such popular tool is Scholarly, which has gained significant popularity on GitHub with over 1.2k stars. Scholarly is a Python-based program that enables users to retrieve author and publication information from Google Scholar in a user-friendly and Pythonic manner, eliminating the need to solve CAPTCHAs. However, there is one drawback to using Scholarly. When using the scholarly.search_pubs() function, users are required to set up proxies, which may not be user-friendly, especially for new users who are unfamiliar with proxy setup.
One effective approach to circumvent this issue is to conduct literature searches in a semi-automatic manner. Selenium, a browser automation framework primarily used for browser testing, can be utilized for this purpose. By leveraging Selenium, manual actions such as clicking and submitting can be automated through programming. Additionally, when encountering CAPTCHAs, users can manually solve them, while the program handles the extraction of information from the webpages. This approach allows the program to assist humans in a collaborative manner, significantly speeding up the process compared to manual searching and eliminating the need for proxy settings. However, it's important to note that this approach is not fully automated and requires human attention and involvement.
This article presents an innovative semi-automatic program called GSC-ABE, which aims to streamline the literature search process. The algorithm, key components, and various challenges associated with GSC-ABE will be discussed in the following sections.
Basic pipeline
GSC-ABE serves as an integral component within a larger database building work pipeline. To provide context, let's outline the general workflow of this pipeline.
Initially, I employ a literature search engine such as Google Scholar or the search engine provided on the official website of relevant journals to gather a set of initial articles, which I refer to as "article seeds." This selection is primarily based on my existing knowledge of the subject matter. Typically, this yields tens of papers that I proceed to read in order to gain a preliminary understanding of the field. However, at this stage, I lack a comprehensive overview of the field, making it challenging to discern the most significant papers or authors. Consequently, identifying the truly crucial problems within this field proves difficult.
After gathering the "article seeds," the next step is exporting them as a CSV file using Zotero. This CSV file is then fed into the program GSC-ABE. GSC-ABE operates by reading each item within the file line by line, extracting the corresponding article titles, and conducting a Google search for each article. By doing so, GSC-ABE identifies papers that cite the aforementioned articles on the Google Scholar platform. Specifically, papers with a moderate number of citations, such as 15, are recorded and ultimately exported to a separate file.
The resulting file from the previous step is imported into Zotero, where a new folder is created to store these articles. Within Zotero, further processing of these article items takes place. Firstly, each article item is assigned its corresponding DOI. Using the DOI, information such as the citation count and availability of the SciHub PDF version is retrieved. Following this post-processing stage, a new CSV file can be exported for subsequent analysis. This analysis may involve determining the number of appearances by each author, examining the publication year distribution, and other relevant factors.
Key parts
HTML elements
The crucial HTML elements are defined within the header section of the code.
1 | GOOGLE_HOMEPAGE_SEARCH_BOX_TAG = '//*[@name="q"]' |
Wait
function
These days, most of the web apps are using AJAX techniques. When a
page is loaded by the browser, the elements within that page may load at
different time intervals. This makes locating elements difficult: if an
element is not yet present in the DOM, a locate function will raise an
ElementNotVisibleException
exception. Using waits, we can
solve this issue. Waiting provides some slack between actions performed
- mostly locating an element or any other operation with the
element.
Selenium Webdriver provides two types of waits - implicit & explicit. An explicit wait makes WebDriver wait for a certain condition to occur before proceeding further with execution. An implicit wait makes WebDriver poll the DOM for a certain amount of time when trying to locate an element.
In GSC-ABE, we use a combination of explicit and implicit wait. Two
examples are
wait.until(EC.presence_of_element_located((By.NAME, 'q')))
and
wait.until(EC.element_to_be_clickable((By.XPATH, GOOGLE_SEARCH_RESULT_CITI_FRMAE)))
which waits for the presence of a element and waits for the element to
be clickable. browser.implicitly_wait(0.5)
is used in
waiting CAPTCHAs.
CAPTCHA test
1 | def is_bot_test(browser): |
A CAPTCHA test is employed whenever the browser attempts to open a
new page or click on an element that triggers the opening of a frame.
The determination within the if
statement is achieved
through a trial-and-error approach. Remarkably, this method proves to be
highly effective, even with a relatively short wait time set at only 0.5
seconds.
It is worth mentioning that even if you successfully pass the CAPTCHA test each time, Google may still block your IP address. To overcome this issue, a strategy that proves to be simple yet effective is to restart the browser. By restarting the browser, you can evade the IP blocking imposed by Google.
Some challenges
However, there are a few challenges that should be noted:
Despite the overall functionality of the program, there may still be some errors that have not been addressed. While these errors may require further refinement, they do not significantly impact the program's daily usage.
In the event of an error, the program will skip the current article item and save it into a separate list of failed items. Consequently, running the program twice may be necessary to obtain a complete and accurate result.
Utilizing proxies can transform the program into a fully automated tool. By implementing proxies, the program can enhance its capabilities and automate the process even further.
Conclusions
The decision to replace the proxy-based automatic searcher with a proxy-free, Selenium-based semi-automatic searcher has proven to be a wise choice. With proper handling of the wait function and CAPTCHA tests, the tool operates seamlessly. Its functionality allows new researchers to swiftly gain a fundamental understanding of the research field and establish a database for future investigations. Looking ahead, there are plans to further enhance the tool by developing a proxy-based automatic version. This update will enable the tool to automate the search process while utilizing proxies.