Using Perl and Typical Expressions to Process Html Data files – Part 2

In this article we will examine how to alter the contents of an HTML file by functioning a Perl script on it.

The file we are going to system is named file1.htm:

Note: To be certain that the code is displayed effectively, in the example code demonstrated in this article, square brackets ‘[..]’ are employed in HTML tags alternatively of angle brackets ”.

[html]
[head]Using Perl and Typical Expressions to Process Html Data files – Part 2Sample HTML File[/title]
[website link rel=”stylesheet” sort=”text/css” rel=”nofollow” onclick=”javascript:ga(‘send’, ‘pageview’, ‘/outgoing/article_exit_website link/362029’)” href=”design and style.css”]
[/head]
[human body]
[h1]Introduction[/h1]
[p]Welcome to the globe of Perl and typical expressions[/p]
[h2]Programming Languages[/h2]
[table border=”one” width=”four hundred”]
[tr][th colspan=”2″]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Regular use[/td][/tr]
[tr][td]JavaScript[/td][td]Consumer-aspect scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-aspect scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/human body]
[/html]

Visualize that we need to alter both occurrences of [h1]heading[/h1] to [h1 class=”major”]heading[/h1]. Not a major alter and anything that could be very easily completed manually or by performing a very simple look for and switch. But we are just receiving begun right here.

To do this, we could use the next Perl script (script1.pl):

one open up (IN, “file1.htm”)
2 open up (OUT, “>new_file1.htm”)
three even though ($line = [IN])
four $line =~ s/[h1]/[h1 class=”major”]/
five (print OUT $line)
6
seven near (IN)
eight near (OUT)

Note: You do not need to enter the line figures. I have bundled them merely so that I can reference particular person strains in the script.

Let’s search at each line of the script.

Line one
In this line file1.htm is opened so that it can be processed by the script. In get to system the file, Perl works by using anything named a filehandle, which gives a variety of website link between the script and the functioning procedure, containing data about the file that is currently being processed. I have named this “opening” filehandle ‘IN’, but I could have employed something within just reason. Filehandles are commonly in capitals.

Line 2
This line produces a new file named ‘new_file1.htm’, which is penned to by working with a further filehandle, OUT. The ‘>’ just prior to the filename suggests that the file will be penned to.

Line three
This line sets up a loop in which each line in file1.htm will be examined separately.

Line four
This is the typical expression. It queries for one occurrence of [h1] on each line of file1.htm and, if it finds it, modifications it to [h1 class=”major”].

Seeking at Line four in much more element:

    • $line – This is a variable that is made up of a line of text. It gets modified if the substitution is thriving.
    • =~ is named the comparison operator.
    • s is the substitution operator.
    • [h1] is what wants to be substituted (changed).
    • [h1 class=”major”] is what [h1] has to be altered to.

Line five
This line can take the contents of the $line variable and, via the OUT file take care of, writes the line to new_file1.htm.

Line 6
This line closes the ‘while’ loop. The loop is repeated until finally all the strains in file1.htm have been examined.

Strains seven and eight
These two strains near the two file handles that have been employed in the script. If you skipped off these two strains the script would however perform, but it truly is good programming exercise to near file handles, consequently freeing up the file take care of names so they can be employed, for example, by a further file.

Jogging the Script

As the objective of this article is to describe how to use typical expressions to system HTML files, and not necessarily how to use Perl, I do not want to spend as well very long describing how to operate Perl scripts. Suffice to say that you can operate them in several means, for example, from within just a text editor these kinds of as TextPad, by double-clicking the perl script (script1.pl), or by functioning the script from an MS-DOS window.

(The place of the Perl interpreter will need to be in your Path statement so that you can operate Perl scripts from any place on your computer and not just from within just the listing wherever the interpreter (perl.exe) by itself is mounted.)

So, to operate our script we could open up an MS-DOS window and navigate to the place wherever the script and the HTML file are positioned. To retain daily life very simple I have assumed that these two files are in the exact same folder (or listing). The command to operate the script is:

C:>perl script1.pl

If the script does perform (and hopefully it will), a new file (new_file1.htm) is created in the exact same folder as file1.htm. If you open up the file you will see the the two strains that contained [h1] tags have been modified so that they now go through [h1 class=”major”].

In Part three we’ll search at how to take care of multiple files.