ThankGod Ukachukwu
4 min readMar 6, 2020

Hack for Web Crawling and HTML Parsing in React Native Application

NB: This article is going to be updated highlighting the use of HTML parsers alongside the unorthodox approach.

Sometimes in software development, we encounter scenarios which by evolution have conventional solutions. So the professionals would always warn, do not do this! For instance, one of the conventions is Don’t parse HTML with Regex! That was the answer I got to this question on Stackoverflow which was eventually closed by some people on there who don’t pause to understand certain questions but bully people because “this is the way it is done”. Anything outside the orthodox is unacceptable.

I encountered a problem while developing a global shopping app using React Native (RN). The problem was I needed to transfer shopping cart from shopping websites like amazon.com and ebay. The Webview component in RN has a mechanism through which I can inject Javascript code into the loaded DOM and locate what is of interest to me, in this case the product name and price. A snippet of what the Webview would look is below.

Webview in a React Component

This problem by default can be solved by web crawling and a stroll to google would bring up Python based solutions and also Node.js based solutions. This is a mobile app. It is written in React Native. Python and Node.js are backend. So I would need to capture the URL send to my backend API and it would process and return the needed strings. Now what it means is that another layer will be added to the application. There is a backend written in PHP (web crawling solutions in PHP are not as much as Python and Node.js) and, I want to prevent additional API calls and I think I can solve this problem in the app without any API calls.

So the first in-app solution that comes to mind which is what the people at Stackoverflow recommended is “HTML Parser”. The implication of this is that I will have to load a HTML parser and then search for the strings I need using something like document.getElementByClass(“product-price”) or byId. So I proceeded to check out HTML parsers. I tried a few and the ones I tried were either not returning anything or hung the application when it is called, like react-native-html-parser. However, react-native-cheerio worked. But given that the HTML structure of each websites is completely different from each other. The HTML element where the product name is displayed comes in different form.

Consequently, I needed an unorthodox solution. The guys at Stackoverflow cannot deter me from making the application work. They are not my client, my client doesn’t know what the code looks like, what client cares about is that application is working efficiently. If it this was a team work, I would defend my implementation and maybe another team member would come up with another pull request and the team leader determines which one should be merged. I would learn more. However, this is lone ranger stuff.

So in the Webview component, I have to retrieve the entire webpage using javascript code injected:

const jsCode = "window.postMessage(document.documentElement.innerHTML)";

First approach with HTML parser was to check if any of the of the possible class name for product name such as below is present on the product page. If present, I capture the content using Cherrio. And after some clean up, set the product name captured.

product-title', 'productTitle', 'producttitle',
'product-name', 'productName', 'productname',
'item-name', 'itemName','itemname',
'itemTitle', 'itemtitle',
'BOLD', 'item-title', 'cName', 'productDescription'
const $ = cheerio.load(wishData);
iList = $('.product-name').html();
this.setState({ productNameJscode: nutsArray[0] });

The second approach for situations where the default approach didn’t work, I searched the string for index of the product-price or productprice or productname and similar tags from above that could be used on various shopping carts CSS. The Webview gives us an onmessage function attribute which receives the result of injectedJavaScript={jsCode} attribute.In the callback, I have the _getProductName(wishData) and this._getProductPrice(wishData) to locate the index of the class,

onMessage={event => {//Html page
const wishData = event.nativeEvent.data;
this._getProductName(wishData);
this._getProductPrice(wishData);
}}

I picked up a few 100 characters like 700 after I encounter the classname and used regex to extract the string I was looking for and do some cleanup and setState and when user wants to go to the product summary, pass the required string passed in as props to the component going to display the information

var prodTT = newText.match(/>[ \n]*?(.+)[ \n]*?</);
if(prodTT!==null){

var esult= prodTT[0].replace('>','');
var resultt = esult.replace('<','');
console.log("product:" + resultt.trim());

this.setState({ productNameJscode: resultt });

}

Example of the result is shown below.

Responses (1)