For the first half of this article the main topic is multithreading in the 2 scripting languages, Perl and Ruby. By writing a multithreaded download manager application in Perl and then porting it to Ruby, it'll show you how to write a multithread application in the both languages and show you the difference of these 2 languages in this area. This section should be fairly easy and doesn't require much knowledge about the scripting languages, but it's expected that you have basic grasp of multithread programming.
The second half is for a bit more advanced programming topic; it's about how to write a C++ application with an embedded Perl or Ruby interpreter. Simply embedding them is not rocket science, but using them in an effective manner is not a very easy task right now because of the implementations of these languages. If you are familiar with .NET you might know it's embed-friendly with AppDomain and COM interfaces. On the other hand you have only raw C interfaces for these scripting languages, let alone scarce documents. As for Perl embedding, the sample code is based on the version I actually implemented in the web server of the DICE. Since it's realized by the mixture of C++ code and Perl hack, it requires some knowledge of C/C++, advanced Perl programming, and Perl internals. But don't be scared, I'll annotate most lines in the code to make it useful for as many people as possible because it's the very purpose of this article! Last but not least, the platform for those experiments is Microsoft Windows XP and Visual C++ 7.1. But due to the platform-neutral nature of these scripting languages most things should be applicable to any platforms.
Though Microsoft Windows is my primary desktop, I still use the Perl scripting language to make a code snippet for a trivial text manipulation task such as list generation or some network test. I know the Ruby object-oriented scripting language has become one of the hot topics among tech-savvy people, but I'd prefered Perl just because I was familiar with it for longer time through developing Perl CGI applications in the earlier part of my programming history. My most preferred programming language is by far C++, but it's not exactly handy to write a tiny application. I like C# and .NET too, but it's still overkill. Dynamic languages have certain drawbacks and I would like to criticise them, but it's true dynamic languages are sweet.
There are many scripting languages, Perl, Ruby, Python, PHP... First Python is out, I don't like significant whitespaces because I add lots of indents for an aesthetic reason only. Next PHP is dropped, its focus seems to be the companionship with Apache and nothing else. Ruby is cool, but I didn't like the begin...end block. Braces can be used as blocks, but not in other places. Basically I'm too fond of the C-style (or Algol-style) syntax. Perl 5 is a clumsy procedural scripting language built on the C-like syntax, and the only point I applaud in Perl is that it recommends the C-like syntax (albeit with a little difference here and there) if you try to be consistent. If I could use Ruby with C/Java like syntax I'd completely abandon Perl anyday. But it didn't happen.
Besides, object-orientedness in Ruby doesn't matter in a small code snippet, since humans are not that dumb and can use non-object-oriented, non-intuitive expressions. Well it's not totally useless, but it makes more sense in the other situations. When you are building a huge software stack or a huge loosely-coupled software network, capsulation is very important. There are other goodies in Ruby, but I couldn't care less about esoteric syntax sugars. The only one left is Perl. Is it the perfect language for writing a code snippet? It has to pass an actual test to see how perfect it is anyway.
One day I noticed my favorite download manager application for Windows didn't support chunked HTTP transfer properly. The downloader / web site grabber application, Irvine, is an excellent Japanese software but the development halted a few years ago. Its document said it would support it in future, but it's seemed that it wouldn't arrive for a while. If I have enough time I want to create an HTTP download manager just for myself in C# or something, but it's not possible at that time due to time constraints. The functionality I needed was to download many files off a web site that is configured to send all files in chunked transfer. Then I began evaluating a network library of Perl to see whether it supports chunked transfer. If it does I can just write a short Perl script. But it doesn't end there, as I had to download hundreds of files it's preferable that it downloads multiple files simultaneously just like the aforementioned Windows application. I knew Perl had implemented multithreading years ago, but still had a vague impression that it might be awkward. So this is a nice occasion to make an experiment on these 2 points: HTTP networking and multithreading.
Thinking about them, it came to my mind that I'd read somewhere that Ruby supported them out of the box. Ruby supports multithreading even in DOS by its own non-native threads. Also Ruby's library is supposed to be able to handle networking with ease as far as I know. I'd known how Ruby works and what kind of things are available for it for years but didn't write a Ruby code just because Perl was sufficient for my use. On the other hand the transition from Perl 5 to Perl 6 seemed not exactly smooth and I'd waited years for it to come but eventually the whole Perl scene got out of my interest when I was into other things. So this is a good occasion to evaluate Ruby for myself and see how my prejudice against it can stand.
So let's write a multithreaded download manager for Perl. First, we have to choose how to download files. The goal is to get an application that can handle chunked transfer properly and just that. Before writing a complicated network negotiation with basic Perl Socket classes, it's safe to search what can be done with preexisting tools. libwww-perl (LWP) is a set of Perl modules which provides API to write a web client. Fortunately it could download a file from the web site from which the Windows downloader application failed to download a file. If the standard Perl distribution for Windows available from ActiveState doesn't have LWP, you have to download it from the author's page at the CPAN and build it with the nmake of Visual C++ (probably freely available from Microsoft) in a command console. libwww-perl gives you a web client object to surf the web.
Another thing that needs clarification is how Perl's multithreading works. Perldoc has a tutorial for Perl multithreading, perlthrtut. This document carries most of the info you need to program a Perl multithread application. If there's only one point specific to the Perl thread, it's that memory space is not shared by threads unlike usual thread implementations. Most of primitives employed in multithreaded programming are available in there. For more info about thread-related modules in Perl such as threads::shared, see their documents from the author's page at CPAN.
By the way I first tried to make my downloader application in this way - it spawns multiple worker threads from the main thread and suspends the main thread by putting a blocking semaphore (by setting the counter value) in it, then a worker thread downloads a file into the download queue, and it spawns another thread at the end of a thread. When the queue becomes empty the semaphore in the main thread is lifted, then all threads are joined and the main thread ends. The total number of worker threads is constant throughout execution. But it didn't work as expected, for some unknown reason the Perl interpreter crashes at the end of execution everytime I run it though it can download all the files in the list. If what you are writing is a native multithread application for Microsoft Windows this should work, but in Perl's case you have to join all threads without spawning a new thread from a running thread. I first thought it's caused by a bug in the thread module and browsed the forum for the module. What I learned there was some modules are not thread-safe and can't be used in Perl threads. libwww-perl looks like one of them and if I removed it the program could end without an error. Seeking the solution I downloaded the latest version of the threads module from the author's page and built it. I tested threads-1.42 which was the latest version. It had a small problem when compiled with VC++, though. It couldn't detect the existence of a C compiler even though I used the command console that came with the Microsoft Platform SDK. The solution is to edit Makefile.pl and skip the have_cc subroutine. But it crashed too. Without a choice I abandoned this initial design and decided to use a more traditional, pthread-esque design.
It spawns worker threads and they get in infinite loops that pick up a job from a job queue one by one. The main thread blocks by calling the
After user configuration options, necessary modules are loaded along with other tools useful throughout this application.
The
The next part is the main loop of the script.
It's very short. It's cooler that there are less things to tweak than to be error-prone. It reads the content of the URL list and puts all of them in the download job queue. Then spawns worker threads, and waits them by calling
A spawned worker thread calls the function referenced as
It instantiates an object of the
That's all for the Perl downloader script. It's about 200 lines and it works, no big deal. So let's move onto the next task, porting it to Ruby.
Before writing the code, since Japanese is my 1st language I went to the Japanese side of the Ruby HQ, ruby-lang.org for the Ruby language/library reference. It's known that Ruby was invented in Japan and some non-Japanese people often complain help documents of Ruby are rather weak compared to Perl or Python in a language comparison war. But the reality is, the Japanese documents for Ruby are downright terrible. At least the official manual is very sparse and unorganized. Probably mailing lists and code samples in Japanese may carry beefier contents than English equivalents, but the Japanese documents at ruby-lang.org are not something you yearn for and I actually found English documents linked from the English side are more useful.
Writing a Ruby code immediately after writing a Perl code is a bit puzzling experience. In Ruby, the
Let's take a look at the Ruby version. It has the user configuration part at the beginning and it won't need much explanation as it's almost identical to the Perl version. Some variables are declared as instance variables of the main function to make them available in methods. In the Ruby version, HTTP negotiation is handled by the standard
The following section contains the Ruby version of the functions with the same name as in the Perl version. The only difference is it uses synchronization primitives (mutex) more intuitively.
The next section is again the function with the same name as the Perl version, but its content is a little different.
By processing a code block the Net::HTTP class can process an incoming HTTP response-body stream in multiple small chunks as the
The rest covers the main function. I think you can see the pattern here and it needs no further explanation.
That's all for the Ruby version.
Now how do they compare? The first thing to be noticed is the script length. The Perl version is over 200 lines and 4,890 bytes. The Ruby version is under 150 lines and 3,383 bytes. The second point is the performance. Squeezing extra performance is the only reason why they are multithreaded. I used them to download dozens of files and the Perl version took up about 15MB RAM and the Ruby version occupied only 5MB. This is probably due to the difference of downloading methods between 2 network libraries (whole download vs partial), but also due to the implementation of interpreter threads in these languages, since 15MB RAM usage is a bit too high for downloading files around 100KB. The speed seemed a tie, but in some cases the Perl version was faster and the Ruby version stuttered. Probably it can be attributed to file I/O and some context-switching awkwardness in the Ruby version.
My final verdict on this subject is, Ruby is the winner, contrary to some of the possible doubts I suggested above. Of course I still don't like
But this discourse doesn't end there. Ruby is certainly a new kid on the block (well not very new actually but leave it at that now) that presents great opportunities. Even then, there may be something useful in an older and proven tool. My server application, DICE, ended up with an embedded Perl interpreter instead of Ruby. DICE has already had a Common Language Runtime (CLR) embedded into it since years ago, but for the upcoming new version that took over a year in preparation I planned to add something new for another bullet point for the update list. Though at first I wanted to embed PHP which is fairly popular for a web application because of the affinity with the Apache HTTPd, I abandoned that idea for now. The information about PHP embedding seemed scarce on the web. (BTW I downloaded the source code of the PHP5 and browsed it, it's not much larger than the source code of the DICE in size which was a surprise for me.) So the candidates are, as you expect, Ruby and Perl. What I'd like to accomplish is embedding an interpreter in the DICE and making it process a web application, but not executing it for every request ala CGI. However, the merit of including such an interpreter is that it can support existing applications available without writing a new code. The DICE has a CLR embedded in it to execute a web application, but it requires a user to write a stub code, or an entire application. This time I want existing CGI applications to run on the DICE. Therefore the goal is to add mod_perl or mod_ruby like capability to it. The requirements are
1. The interpreter is persistent.
2. All input from the standard input and all output to the standard output must be hooked by the DICE because stdin/stdout are the points where CGI scripts interact with an internet user through a web server.
This section begins with Ruby. Before writing a code for embedding Ruby, the license of Ruby has to be examined. It adopts the dual license by the GPL and the other BSD-like license. It seems you can do whatever with Ruby if you don't choose GPL, but in reality it's not that simple. The current Ruby implementation (1.8) contains the GNU regular expression library and it forces you to make your application GPLed if you just embed a Ruby interpreter as is. Actually there's a workaround and it's very simple. The current development version of Ruby (1.9) has the Oniguruma regular expressions library which is under the BSD license to avoid this license issue. If you choose to stick to 1.8, the source code of the Oniguruma version 2.x is available for download, it can be integrated to the Ruby 1.8 source code (the instruction is included in the Onigruma document). After building the whole Ruby, you can open the binary (msvcr71-ruby19.dll if you use VC++ 7.1 and Ruby 1.9) with Dependency Walker to see if it has Onigruma symbols (prefixed with Onig) and not GNU regex ones.
The first thing you have to do is to download the Ruby source code at the download page of the ruby-lang.org. Then obviously you have to learn how to embed Ruby. The particular document I referred to when I wrote my sample code was magazine articles written by Shugo Maeda, the author of mod_ruby. They are here and here at Mr. Maeda's web site, and his VIM patch is here. They show how he embedded Ruby into the vi-clone editor VIM. Unfortunately they are in Japanese though some code samples are also available with them. And this is the real horror story, I couldn't find other detailed Japanese documents that are helpful in embedding Ruby. Actually the English EmbedRuby article at the Ruby Garden is probably the second best source of info on this matter. After reading the tutorial, you can download the C source code of mod_ruby to learn how a real-world application hosts a Ruby interpreter though it may be hard for those who are not familiar with how an Apache module works.
In addition to the basics of embedding Ruby, Maeda's article offers how to interrupt the standard output of Ruby and trap it to feed to VIM instead of the console screen. Ruby's outputs are all assigned in the
To build a C++ code with a Ruby interpreter embedded, Ruby header files, a Ruby static library and dynamic link library are required. The header file,
Once the Ruby library is ready you have to set up a build environment for a Ruby-embedded application. Set the root directory of the Ruby source code in your include path and set the library path in your LIB directory (In VC++ it's in Tool | Options | Projects | VC++ Directories | Library files). As the resulted executable requires
The environment is ready, let's see the actual C++ code. This sample code is written for a Ruby 1.9 development snapshot and doesn't work with Ruby 1.8 as there are significant changes in the Ruby source code in 1.9. What this code does is fairly simple, initializes a Ruby interpreter, loads and executes a Ruby script, then destroys the interpreter. It includes necessary files at the beginning. For the sake of simpleness I included the library file by the pragma, but it may be safe to do it in the project settings if your project contains many files that may conflict with it. (In the case of Perl embedding, it didn't work.) In addition to the
First, output hook functions are defined.
The last part is the function that loads and executes a Ruby script (
The
When this program is executed it loads the "test.rb" Ruby script. My "test.rb" was a simple script that does print some strings and apparently it could output characters through the redefined hook functions. I thought that my experiment was a success. But I did one more test. The new test fed the Ruby downloader script which I described above to this test code. To my surprise, it halted with the "memory error" in the Ruby interpreter. I had no idea what's wrong, but apparently the culprit was the
The last hope is naturally on Perl. First you need the Perl source code. I used a Perl development snapshot available at here but snapshots of the development trunk in this directory often fail to build or pass the test. After looking at the latest version number of the stable release at Perl.com , you may pick up the latest stable snapshot in another directory. You can find exactly which snapshot is broken at the perl.perl5.porters newsgroup.
To build it on Windows, edit the
General information on how to embed a Perl interpreter into your C/C++ application is available in the perlembed document. However, I recommend you to browse perlguts and perlcall before going to perlembed. Also perlapi may be useful for a reference. perlguts explains how the C implementation of Perl represents Perl's data types and subroutines in C. perlcall explains how to call Perl subroutines from C. If you still have questions after reading these documents, you can always grep the Perl source code for functions, function macros and comments.
For an actual embedding code sample, mod_perl is probably the most complete. If you read
1. Setup environmental variables passed to Perl
2. Override Perl's standard input and standard output
3. Process an HTTP request to a Perl CGI
4. Restore everything back
Basically it's in line with what I wrote I wanted to do earlier in this article. Apache does most of these things by directly manipulating Perl by C code. But I don't need as much security as the Apache module does, since the current version of the DICE doesn't support multi-user. Running multiple interpreters is not required for now since the context switching method by
I decided to offload most of the sandbox work to Perl to keep the C++ part simple. The basic design is like this:
1. The Perl interpreter in the DICE loads a setup Perl script which sets
2. Whenever a CGI script is called, if it's the first time, the Perl interpreter evaluates it as a subroutine in a unique package by using
3. The Perl interpreter executes a compiled Perl code by dereferencing its subroutine reference and the C++ part gets the output by reading a Perl variable to collect all Perl output. The C++ part is protected and synchronized by a critical section that keeps this Perl variable alive (though there's certain performance penalty)
Let's take a look at the Perl setup script (
When a Perl script is executed, this class is associated with
The next section defines
The
From here on the main class EmbeddedPerlSandbox is defined.
The next subroutine is necessary to make a user script behave like a Perl CGI. If a persistent interpreter just executes the same compiled code twice, all global variables persist in the second execution. For example,
if this code is executed repeatedly in a persistent interpreter the displayed number continues to increase. This behavior is OK for a Java servlet and other persistent web applications, but not desirable if a Perl CGI is the target of emulation. It can be avoided by clearing global package variables previously defined in the package before executing a user code. This
The next subroutine compiles a user code as the name suggests.
These subroutines are just utility functions.
The
First it changes the current directory to the directory of the user script by calling
The last section is for the hook functions for these dangerous global functions.
The
That's all for the Perl side (
The
The
The next part calls the
It pushs two strings
In this section it actually execute the compiled Perl code.
After
When the work is done it frees the Perl interpreter. It's the end of the C++ sample code. I implemented it in the DICE with additional care for a persistent interpreter.
The winner of this round is Perl. But it's not free from a problem, for example its header file redefines common API names with C macros. It's not designed to host multiple interpreters and plagued with excessive C macro use. I hope this mess is cleaned up in Perl 6 with a new VM (Parrot), just like Ruby's new VM (YARV) plans. If Perl 6 becomes the mainstream this article has to be rewritten since the embedding method should be fairly different from the current one, though I somehow doubt the likelihood that Perl 6 becomes too popular. One thing for sure is I'm satisfied with the relative robustness of Perl, for now.
The second half is for a bit more advanced programming topic; it's about how to write a C++ application with an embedded Perl or Ruby interpreter. Simply embedding them is not rocket science, but using them in an effective manner is not a very easy task right now because of the implementations of these languages. If you are familiar with .NET you might know it's embed-friendly with AppDomain and COM interfaces. On the other hand you have only raw C interfaces for these scripting languages, let alone scarce documents. As for Perl embedding, the sample code is based on the version I actually implemented in the web server of the DICE. Since it's realized by the mixture of C++ code and Perl hack, it requires some knowledge of C/C++, advanced Perl programming, and Perl internals. But don't be scared, I'll annotate most lines in the code to make it useful for as many people as possible because it's the very purpose of this article! Last but not least, the platform for those experiments is Microsoft Windows XP and Visual C++ 7.1. But due to the platform-neutral nature of these scripting languages most things should be applicable to any platforms.
Though Microsoft Windows is my primary desktop, I still use the Perl scripting language to make a code snippet for a trivial text manipulation task such as list generation or some network test. I know the Ruby object-oriented scripting language has become one of the hot topics among tech-savvy people, but I'd prefered Perl just because I was familiar with it for longer time through developing Perl CGI applications in the earlier part of my programming history. My most preferred programming language is by far C++, but it's not exactly handy to write a tiny application. I like C# and .NET too, but it's still overkill. Dynamic languages have certain drawbacks and I would like to criticise them, but it's true dynamic languages are sweet.
There are many scripting languages, Perl, Ruby, Python, PHP... First Python is out, I don't like significant whitespaces because I add lots of indents for an aesthetic reason only. Next PHP is dropped, its focus seems to be the companionship with Apache and nothing else. Ruby is cool, but I didn't like the begin...end block. Braces can be used as blocks, but not in other places. Basically I'm too fond of the C-style (or Algol-style) syntax. Perl 5 is a clumsy procedural scripting language built on the C-like syntax, and the only point I applaud in Perl is that it recommends the C-like syntax (albeit with a little difference here and there) if you try to be consistent. If I could use Ruby with C/Java like syntax I'd completely abandon Perl anyday. But it didn't happen.
Besides, object-orientedness in Ruby doesn't matter in a small code snippet, since humans are not that dumb and can use non-object-oriented, non-intuitive expressions. Well it's not totally useless, but it makes more sense in the other situations. When you are building a huge software stack or a huge loosely-coupled software network, capsulation is very important. There are other goodies in Ruby, but I couldn't care less about esoteric syntax sugars. The only one left is Perl. Is it the perfect language for writing a code snippet? It has to pass an actual test to see how perfect it is anyway.
One day I noticed my favorite download manager application for Windows didn't support chunked HTTP transfer properly. The downloader / web site grabber application, Irvine, is an excellent Japanese software but the development halted a few years ago. Its document said it would support it in future, but it's seemed that it wouldn't arrive for a while. If I have enough time I want to create an HTTP download manager just for myself in C# or something, but it's not possible at that time due to time constraints. The functionality I needed was to download many files off a web site that is configured to send all files in chunked transfer. Then I began evaluating a network library of Perl to see whether it supports chunked transfer. If it does I can just write a short Perl script. But it doesn't end there, as I had to download hundreds of files it's preferable that it downloads multiple files simultaneously just like the aforementioned Windows application. I knew Perl had implemented multithreading years ago, but still had a vague impression that it might be awkward. So this is a nice occasion to make an experiment on these 2 points: HTTP networking and multithreading.
Thinking about them, it came to my mind that I'd read somewhere that Ruby supported them out of the box. Ruby supports multithreading even in DOS by its own non-native threads. Also Ruby's library is supposed to be able to handle networking with ease as far as I know. I'd known how Ruby works and what kind of things are available for it for years but didn't write a Ruby code just because Perl was sufficient for my use. On the other hand the transition from Perl 5 to Perl 6 seemed not exactly smooth and I'd waited years for it to come but eventually the whole Perl scene got out of my interest when I was into other things. So this is a good occasion to evaluate Ruby for myself and see how my prejudice against it can stand.
So let's write a multithreaded download manager for Perl. First, we have to choose how to download files. The goal is to get an application that can handle chunked transfer properly and just that. Before writing a complicated network negotiation with basic Perl Socket classes, it's safe to search what can be done with preexisting tools. libwww-perl (LWP) is a set of Perl modules which provides API to write a web client. Fortunately it could download a file from the web site from which the Windows downloader application failed to download a file. If the standard Perl distribution for Windows available from ActiveState doesn't have LWP, you have to download it from the author's page at the CPAN and build it with the nmake of Visual C++ (probably freely available from Microsoft) in a command console. libwww-perl gives you a web client object to surf the web.
Another thing that needs clarification is how Perl's multithreading works. Perldoc has a tutorial for Perl multithreading, perlthrtut. This document carries most of the info you need to program a Perl multithread application. If there's only one point specific to the Perl thread, it's that memory space is not shared by threads unlike usual thread implementations. Most of primitives employed in multithreaded programming are available in there. For more info about thread-related modules in Perl such as threads::shared, see their documents from the author's page at CPAN.
By the way I first tried to make my downloader application in this way - it spawns multiple worker threads from the main thread and suspends the main thread by putting a blocking semaphore (by setting the counter value) in it, then a worker thread downloads a file into the download queue, and it spawns another thread at the end of a thread. When the queue becomes empty the semaphore in the main thread is lifted, then all threads are joined and the main thread ends. The total number of worker threads is constant throughout execution. But it didn't work as expected, for some unknown reason the Perl interpreter crashes at the end of execution everytime I run it though it can download all the files in the list. If what you are writing is a native multithread application for Microsoft Windows this should work, but in Perl's case you have to join all threads without spawning a new thread from a running thread. I first thought it's caused by a bug in the thread module and browsed the forum for the module. What I learned there was some modules are not thread-safe and can't be used in Perl threads. libwww-perl looks like one of them and if I removed it the program could end without an error. Seeking the solution I downloaded the latest version of the threads module from the author's page and built it. I tested threads-1.42 which was the latest version. It had a small problem when compiled with VC++, though. It couldn't detect the existence of a C compiler even though I used the command console that came with the Microsoft Platform SDK. The solution is to edit Makefile.pl and skip the have_cc subroutine. But it crashed too. Without a choice I abandoned this initial design and decided to use a more traditional, pthread-esque design.
It spawns worker threads and they get in infinite loops that pick up a job from a job queue one by one. The main thread blocks by calling the
join
function on these worker threads. This looks fairly simple even from the description of it and may look better than the one I described in the previous paragraph, but I prefer a more asynchoronous style if it works. I wrote it and this time it worked flawlessly without a crash or resource leaks. It's tested with the Win32 version of Perl 5.8.8 available from ActiveState, threads-1.42 module, and libwww-perl-5.805 module. It begins with a simple user configuration section.
# user configuration begin ################################
# You have to list the URLs for the files to be downloaded in a file named
# "download_files.txt" and put it in the same directory as this script.
my $username = "";
my $password = "";
my $number_of_threads = 3;
my $download_interval_sec = 3;
my $download_list_filename = "download_files.txt";
my $storage_directory = 'E:\program\src\downloader\store'; # set "" for the current directory
my $user_agent_string = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
# user configuration end ##################################
$username
and $password
can be set for basic authentication at password-protected pages. $number_of_threads
tells how much worker threads are executed to process downloading tasks simultaneously. It's also possible to set how many seconds it waits before making a new download connection in $download_interval
.After user configuration options, necessary modules are loaded along with other tools useful throughout this application.
use strict;
use warnings;
use threads;
use threads::shared;
#use Thread::Queue;
use Thread::Semaphore;
use LWP::UserAgent;
use Cwd;
my $current_directory = Cwd::getcwd();
if (!$storage_directory)
{
$storage_directory = $current_directory;
}
$| = 1;
my @download_queue : shared; # used for optimization instead of Thread::Queue
my $sem_download_queue = new Thread::Semaphore;
my $sem_stdout = new Thread::Semaphore;
my $last_download_time : shared = 0;
The
strict
and warnings
modules are in action here to ban obscure expressions which are often seen in Perl hacks. While this is a relatively short script it's always nice that minor things don't bother users. $| = 1
expression is what you always see in Perl scripts that manage realtime I/O. It prevents buffering and forces Perl to issue more I/O calls. @download_queue
is the synchoronized download queue where all jobs are put and fetched from in the serial order. To make it accessible from multiple threads, it has to be marked explicitly with the shared
keyword. This is the Perl-specific way of thread programming I wrote above. Actually Perl already has a useful tool for a situation like this as the Thread::Queue
module though it's absent in this example to control the scope of synchonization explicitly for a bit better performance. So it requires a semaphore object as a synchronization primitive, $sem_download_queue
. $sem_stdout
is the semaphore to synchronize the standard output, since multiple threads try to print text reports simultaneously. $last_download_time
is the variable to force threads to put an interval between downloads as explained in the configuration options.The next part is the main loop of the script.
print "Perl threads version: " . $threads::VERSION . "\n";
open(IN, $current_directory . "\\" . $download_list_filename) || die("Can't open " . $download_list_filename);
while (<IN>)
{
chomp;
if ($_ =~ /^http:/i)
{
push @download_queue, $_;
}
}
close(IN);
print @download_queue . " URIs have been loaded from the download list\n";
for (my $i = 0; $i < $number_of_threads; ++$i)
{
new threads(\&download_thread_func);
}
foreach my $thr (threads->list)
{
if ($thr->tid)
{
$thr->join;
}
}
print "Download completed\n";
################################################################
It's very short. It's cooler that there are less things to tweak than to be error-prone. It reads the content of the URL list and puts all of them in the download job queue. Then spawns worker threads, and waits them by calling
join
. As the join
function blocks, the main thread can stay in memory waiting all worker threads to return from the work. If it didn't block, the main thread would just exit and the whole program would halt while leaking thread resources.A spawned worker thread calls the function referenced as
download_thread_func
. Let's see the part that does the heavy work.
sub print_ts
{
$sem_stdout->down;
my $tid = threads->self->tid();
print ($tid . ": " . shift @_);
$sem_stdout->up;
}
sub download_thread_func
{
for (;;)
{
my $u = "";
my $sleep_time = 0;
$sem_download_queue->down;
if (@download_queue == 0)
{
$sem_download_queue->up;
print_ts "The download queue is empty\n";
return;
}
else
{
$u = shift @download_queue;
# $last_download_time is protected by $sem_download_queue
my $t = time();
if ($t < $last_download_time)
{
$sleep_time = $last_download_time - $t + $download_interval_sec;
}
else
{
if ($t - $last_download_time < $download_interval_sec)
{
$sleep_time = $download_interval_sec - ($t - $last_download_time);
}
$last_download_time = $t + $sleep_time;
}
}
$sem_download_queue->up;
if ($sleep_time)
{
print_ts "Sleeping for " . $sleep_time ." seconds\n";
sleep($sleep_time);
}
download_uri($u);
}
}
print_ts
is just a debug output function mainly used to show HTTP headers. It's a print functon synchronized by a semaphore. The download_thread_func
subroutine fetches a job from the job queue then consumes it. It continues this behavior until the queue gets empty. The part that tests and modifies the download queue is synchronized by a single semaphore. An actual download task is handled by the download_uri
subroutine described below.
sub download_uri
{
my $uri = shift @_;
print_ts "Downloading $uri\n";
my $ua = new LWP::UserAgent;
$ua->cookie_jar({});
my $req = new HTTP::Request(GET => $uri);
$req->header(
"User-Agent" => $user_agent_string,
"Accept" => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
"Accept-Charset" => 'iso-8859-1,*,utf-8',
"Accept-Language" => 'en-US'
);
if ($username)
{
$req->authorization_basic($username, $password);
}
my $response = $ua->request($req);
if ($response->is_success)
{
print_ts "header begin-----------------------------------------\n"
. $response->headers_as_string
. "header end-------------------------------------------\n";
my $is_chunked = (
($response->header('Transfer-Encoding') && $response->header('Transfer-Encoding') =~ /Chunked/i)
|| ($response->header('Client-Transfer-Encoding') && $response->header('Client-Transfer-Encoding') =~ /Chunked/i)
)
? 1 : 0;
if ($response->header('Content-Length') && $response->header('Content-Length') == 0 && !$is_chunked)
{
print_ts "Content-Length is zero\n";
}
else
{
my $image_name = "tmp.bin";
if ($uri =~ /\/([^\/]+)$/)
{
$image_name = $1;
}
if ($response->header('Content-Disposition'))
{
if ($response->header('Content-Disposition') =~ /filename=(.+)/i)
{
$image_name = $1;
$image_name =~ s/"//g;
}
}
if (!open(OUT, ">$storage_directory\\$image_name"))
{
print_ts "open error : $storage_directory\\$image_name\n";
exit();
}
else
{
binmode(OUT);
#my $fsize = $response->header('Content-Length');
if (!defined(syswrite OUT, $response->content, length($response->content)))
{
print_ts "syswrite error : $storage_directory\\$image_name\n";
}
syswrite OUT, $response->content, length($response->content);
close(OUT);
print_ts "Downloaded $image_name\n";
}
}
}
else
{
my $st = $response->status_line;
print_ts "Error: $uri : $st\n";
}
}
It instantiates an object of the
LWP::UserAgent
class to do HTTP negotiation and sets necessary HTTP headers for it. After a request is sent, the object receives an HTTP response header and body. Since the purpose of this script is to download a file off a server that does chunked transfer, it searches an HTTP header line that indicates it, Transfer-Encoding
or Client-Transfer-Encoding
. In a chunked transfer, a real file name is communicated by a Content-Disposition
header, so it scans this header too. When all necessary info are available the rest thing to do is to save received data into a file. If it's a binary file, you have to call binmode
to set the file handle in the binary mode on Windows. The data length can be obtained by a Content-Length
header for a nomal HTTP response but the chunked transfer is the method to send a file without sending a Content-Length
header. Conveniently the data size is already known by the length of $response->content
data.That's all for the Perl downloader script. It's about 200 lines and it works, no big deal. So let's move onto the next task, porting it to Ruby.
Before writing the code, since Japanese is my 1st language I went to the Japanese side of the Ruby HQ, ruby-lang.org for the Ruby language/library reference. It's known that Ruby was invented in Japan and some non-Japanese people often complain help documents of Ruby are rather weak compared to Perl or Python in a language comparison war. But the reality is, the Japanese documents for Ruby are downright terrible. At least the official manual is very sparse and unorganized. Probably mailing lists and code samples in Japanese may carry beefier contents than English equivalents, but the Japanese documents at ruby-lang.org are not something you yearn for and I actually found English documents linked from the English side are more useful.
Writing a Ruby code immediately after writing a Perl code is a bit puzzling experience. In Ruby, the
@
prefix means an instance variable (or an object property, in a more usual object-oriented lingo) unlike the array expression in Perl. In Ruby it's about a variable scope and in Perl it's about a data type. I like Ruby's scope which is more object-oriented and cleaner than Perl's one which is heavily dependent on its symbol table and often the source of dirty hacks. But the main problem is besides that. It's about braces. As I wrote at the beginning of this article I like braces and would like to use it as much as I can do. But Ruby only allows it at blocks and I often trip on it by using braces where it accepts only do...end
. Another thing is, Ruby doesn't require you to end a line with a semicolon, but when you put an unwanted carriage-return it just emits syntax errors and stops. So you can't hit the Enter key just to tidy up the layout. It is especially problematic when you declare a block parameter for a block which is one of the main selling points of Ruby. So my advice for writing a Ruby code is: don't use braces and don't hit the Enter key too often.Let's take a look at the Ruby version. It has the user configuration part at the beginning and it won't need much explanation as it's almost identical to the Perl version. Some variables are declared as instance variables of the main function to make them available in methods. In the Ruby version, HTTP negotiation is handled by the standard
net/http
library which, fortunately again, can download chunked-transferred files. thread
is Ruby's intrinsic class for multithread programming. uri
is a small utility class to handle a URI. Net::HTTP.version_1_2
is the pragma to instruct net/http
to use the newer implementation. if $0 == __FILE__ line
is an include guard but it has no influence in this demo. This script is tested with Ruby 1.8.5.
# user configuration begin ################################
@username = ""
@password = ""
number_of_threads = 3
@download_interval_sec = 3
download_list_filename = "download_files.txt"
@storage_directory = 'E:\program\src\downloader\store' # set "" for the current directory
@user_agent_string = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
# user configuration end ##################################
require 'net/http'
require 'thread'
require 'uri'
Net::HTTP.version_1_2
if $0 == __FILE__
The following section contains the Ruby version of the functions with the same name as in the Perl version. The only difference is it uses synchronization primitives (mutex) more intuitively.
def puts_ts(x)
@mutex_pt.synchronize do
puts Thread.current.object_id.to_s + ": " + x
end
end
def download_thread_func
while true do
u = ""
sleep_time = 0
@mutex_dq.synchronize do
if @download_queue.empty?
puts_ts "The download queue is empty"
return
else
u = @download_queue.shift
# @last_download_time is protected by @mutex_dq
t = Time.now.gmtime.to_i
if t < @last_download_time
sleep_time = @last_download_time - t + @download_interval_sec
else
if t - @last_download_time < @download_interval_sec
sleep_time = @download_interval_sec - (t - @last_download_time)
end
@last_download_time = t + sleep_time
end
end
end
if sleep_time != 0
puts_ts "Sleeping for " + sleep_time.to_s + " seconds"
sleep sleep_time
end
download_uri u
end
end
The next section is again the function with the same name as the Perl version, but its content is a little different.
def download_uri(uri)
puts_ts "Downloading #{uri}"
req = Net::HTTP::Get.new(URI.parse(uri).path)
req["User-Agent"] = @user_agent_string
req["Accept"] = 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*'
req["Accept-Charset"] = 'iso-8859-1,*,utf-8'
req["Accept-Language"] = 'en-US'
req.basic_auth @username, @password unless @username.empty?
begin
Net::HTTP.start(URI.parse(uri).host, 80) do |http|
http.request(req) do |res|
h = "#{res.code} #{res.message}\nheader begin-----------------------------------------\n"
is_chunked = false
filename = ""
res.canonical_each do |n, v|
h += (n + ": " + v + "\n")
if n =~ /Transfer-Encoding/i && v =~ /Chunked/i
is_chunked = true
end
if n =~ /Content-Disposition/i && v =~ /filename=(.+)/i
filename = $1
filename.gsub(/^"|"$/, '')
end
end
h += "header end-------------------------------------------"
puts_ts h
unless res.code == "200"
return
end
if filename == "" && uri =~ /\/([^\/]+)$/
filename = @storage_directory + "\\" + $1
end
open(filename, "wb") do |file|
res.read_body do |str|
file.write str
end
end
end
end
rescue Exception => e
puts_ts e.to_s
end
end
By processing a code block the Net::HTTP class can process an incoming HTTP response-body stream in multiple small chunks as the
read_body
function of the Net::HTTPResponse
class returns. It means you don't have to store the whole received file in memory before dumping it onto a file unlike the Perl version. When a downloaded file is too large this is necessity.The rest covers the main function. I think you can see the pattern here and it needs no further explanation.
<<
is a useful overloaded operator to push an element to the end of an array just like stack manipulation.
# main begin #############################################################
@mutex_pt = Mutex::new
if @storage_directory.empty?
@storage_directory = Dir.pwd
end
@download_queue = []
@mutex_dq = Mutex::new
@last_download_time = 0;
File.foreach(download_list_filename) {|line|
@download_queue << line.chomp if line.match(/^http:/i)
}
threads = []
for i in 1..number_of_threads
threads << Thread::start do download_thread_func end
end
for i in 1..number_of_threads
threads[i - 1].join
end
# main end ###############################################################
end
That's all for the Ruby version.
Now how do they compare? The first thing to be noticed is the script length. The Perl version is over 200 lines and 4,890 bytes. The Ruby version is under 150 lines and 3,383 bytes. The second point is the performance. Squeezing extra performance is the only reason why they are multithreaded. I used them to download dozens of files and the Perl version took up about 15MB RAM and the Ruby version occupied only 5MB. This is probably due to the difference of downloading methods between 2 network libraries (whole download vs partial), but also due to the implementation of interpreter threads in these languages, since 15MB RAM usage is a bit too high for downloading files around 100KB. The speed seemed a tie, but in some cases the Perl version was faster and the Ruby version stuttered. Probably it can be attributed to file I/O and some context-switching awkwardness in the Ruby version.
My final verdict on this subject is, Ruby is the winner, contrary to some of the possible doubts I suggested above. Of course I still don't like
begin...end
, but if you really like the way Java programs are written I guess you like Ruby too. Ruby is even nicer than Java. Basically there's nothing Ruby can't do except for Perl-specific symbolic hacks. Why not just jump ship and switch to Ruby for good?But this discourse doesn't end there. Ruby is certainly a new kid on the block (well not very new actually but leave it at that now) that presents great opportunities. Even then, there may be something useful in an older and proven tool. My server application, DICE, ended up with an embedded Perl interpreter instead of Ruby. DICE has already had a Common Language Runtime (CLR) embedded into it since years ago, but for the upcoming new version that took over a year in preparation I planned to add something new for another bullet point for the update list. Though at first I wanted to embed PHP which is fairly popular for a web application because of the affinity with the Apache HTTPd, I abandoned that idea for now. The information about PHP embedding seemed scarce on the web. (BTW I downloaded the source code of the PHP5 and browsed it, it's not much larger than the source code of the DICE in size which was a surprise for me.) So the candidates are, as you expect, Ruby and Perl. What I'd like to accomplish is embedding an interpreter in the DICE and making it process a web application, but not executing it for every request ala CGI. However, the merit of including such an interpreter is that it can support existing applications available without writing a new code. The DICE has a CLR embedded in it to execute a web application, but it requires a user to write a stub code, or an entire application. This time I want existing CGI applications to run on the DICE. Therefore the goal is to add mod_perl or mod_ruby like capability to it. The requirements are
1. The interpreter is persistent.
2. All input from the standard input and all output to the standard output must be hooked by the DICE because stdin/stdout are the points where CGI scripts interact with an internet user through a web server.
This section begins with Ruby. Before writing a code for embedding Ruby, the license of Ruby has to be examined. It adopts the dual license by the GPL and the other BSD-like license. It seems you can do whatever with Ruby if you don't choose GPL, but in reality it's not that simple. The current Ruby implementation (1.8) contains the GNU regular expression library and it forces you to make your application GPLed if you just embed a Ruby interpreter as is. Actually there's a workaround and it's very simple. The current development version of Ruby (1.9) has the Oniguruma regular expressions library which is under the BSD license to avoid this license issue. If you choose to stick to 1.8, the source code of the Oniguruma version 2.x is available for download, it can be integrated to the Ruby 1.8 source code (the instruction is included in the Onigruma document). After building the whole Ruby, you can open the binary (msvcr71-ruby19.dll if you use VC++ 7.1 and Ruby 1.9) with Dependency Walker to see if it has Onigruma symbols (prefixed with Onig) and not GNU regex ones.
The first thing you have to do is to download the Ruby source code at the download page of the ruby-lang.org. Then obviously you have to learn how to embed Ruby. The particular document I referred to when I wrote my sample code was magazine articles written by Shugo Maeda, the author of mod_ruby. They are here and here at Mr. Maeda's web site, and his VIM patch is here. They show how he embedded Ruby into the vi-clone editor VIM. Unfortunately they are in Japanese though some code samples are also available with them. And this is the real horror story, I couldn't find other detailed Japanese documents that are helpful in embedding Ruby. Actually the English EmbedRuby article at the Ruby Garden is probably the second best source of info on this matter. After reading the tutorial, you can download the C source code of mod_ruby to learn how a real-world application hosts a Ruby interpreter though it may be hard for those who are not familiar with how an Apache module works.
In addition to the basics of embedding Ruby, Maeda's article offers how to interrupt the standard output of Ruby and trap it to feed to VIM instead of the console screen. Ruby's outputs are all assigned in the
$>
special variable and its native C implementation is the object named rb_defout
. By hooking the write
method of this object you can trap all Ruby outputs to the standard output. But how? It's possible by the rb_define_singleton_method
function defined in the class.c of the Ruby source code. It defines a "singleton method" which is a special method for a special object, which is rb_defout
in this case. Also each output method in Ruby can be redefined by the rb_define_global_function
function. The source code of mod_ruby shows that it actually redefines all related methods in Ruby with respective hook functions. VALUE
is a Ruby type which can hold any Ruby object. It resembles the VARIANT
type in COM, or the void* type in C.
-- class.c --
void
rb_define_singleton_method(VALUE obj, const char *name, VALUE (*func)(ANYARGS), int argc)
{
rb_define_method(rb_singleton_class(obj), name, func, argc);
}
void
rb_define_method(VALUE klass, const char *name, VALUE (*func)(ANYARGS), int argc)
{
rb_add_method(klass, rb_intern(name), NEW_CFUNC(func, argc), NOEX_PUBLIC);
}
void
rb_define_global_function(const char *name, VALUE (*func)(ANYARGS), int argc)
{
rb_define_module_function(rb_mKernel, name, func, argc);
}
-- eval.c --
void
rb_add_method(VALUE klass, ID mid, NODE *node, int noex)
{
}
To build a C++ code with a Ruby interpreter embedded, Ruby header files, a Ruby static library and dynamic link library are required. The header file,
ruby.h
, is found in the root directory of the Ruby source code archive. The static/dynamic libraries can be generated by building the Ruby source code. To build it on Windows, open a VC++/Platform SDK build environment console and run the configure.bat
in the win32
directory then type nmake
. You get msvcr71-ruby19.lib
as a static library and msvcr71-ruby19.dll
as a dll. Note that it doesn't produce a static library for static linking. These are release builds that refer to the release version of the MS C-runtime library. I tried to build a debug version by editing the makefile (changing the compiler option /MD to /MDd) but the build process stopped in building one of .exe. It seemed to have created libraries at least, but I couldn't link it with a compiled C++ code in the Debug mode nor in the Release mode.Once the Ruby library is ready you have to set up a build environment for a Ruby-embedded application. Set the root directory of the Ruby source code in your include path and set the library path in your LIB directory (In VC++ it's in Tool | Options | Projects | VC++ Directories | Library files). As the resulted executable requires
msvcr71-ruby19.dll
to run it should be copied into the working directory. As for the C-runtime library, you have to choose a multi-threaded DLL, since Ruby's memory manager is built with a multithread version of the library. Unfortunately the library built by the default makefile has conflicts with other libraries in the Debug mode with the multi-threaded debug DLL (/MDd). This problem can be addressed by putting libcmtd.lib
in Linker | Input | Ignore Specific Library in the VC++ project options for the Debug mode. In the Release mode it's not required, but it still issues linker warnings LNK4049. I tried to create a project file by myself but failed to do that as the Ruby I built couldn't get the current directory as it can't access some APIs. I hope these issues with Windows are resolved in future.The environment is ready, let's see the actual C++ code. This sample code is written for a Ruby 1.9 development snapshot and doesn't work with Ruby 1.8 as there are significant changes in the Ruby source code in 1.9. What this code does is fairly simple, initializes a Ruby interpreter, loads and executes a Ruby script, then destroys the interpreter. It includes necessary files at the beginning. For the sake of simpleness I included the library file by the pragma, but it may be safe to do it in the project settings if your project contains many files that may conflict with it. (In the case of Perl embedding, it didn't work.) In addition to the
ruby.h
, standard C++ headers are included for utility functions.
#pragma comment(lib, "E:\\program\\lib\\rubylib\\ruby_win32\\usr\\lib\\msvcr71-ruby19-static.lib")
extern "C" {
#include <ruby.h>
}
#include <iostream>
#include <string>
using namespace std;
First, output hook functions are defined.
VALUE ruby_write_hook(VALUE self, VALUE str)
{
str = rb_obj_as_string(str);
cout << string(RSTRING_PTR(str), RSTRING_LEN(str));
return Qnil;
}
VALUE ruby_p_hook(int argc, VALUE *argv, VALUE self)
{
VALUE str = rb_str_new("", 0);
for (int i = 0; i < argc; i++)
{
if (i > 0)
rb_str_cat(str, ", ", 2);
rb_str_concat(str, rb_inspect(argv[i]));
}
cout << RSTRING_PTR(str);
return Qnil;
}
ruby_write_hook
is a function to hook the singleton method of rb_defout
as explained already. It extracts a char*
pointer and its length from a Ruby string that is assigned to this special object, and outputs it by itself with cout
. Qnil
is the null value in the Ruby C implementation. ruby_p_hook
is defined to redefine the kernel method p
which can print the state of an object in a human readable form like the ToString
method in C#/Java. It creates a new Ruby string by rb_str_new
and enumerates its elements by adding them with rb_str_concat
.init_ruby
is the function that initializes the Ruby interpreter. ruby_init_loadpath
initializes the library path of Ruby. show_error_pos
and show_exception_info
are the functions that show detailed error information just like what you get when you feed an erroneous Ruby script to the Ruby interpreter.
void init_ruby()
{
ruby_init();
ruby_init_loadpath();
}
void show_error_pos()
{
ID this_func = rb_frame_this_func();
if (ruby_sourcefile)
{
if (this_func)
{
cout << ruby_sourcefile << ":" << ruby_sourceline << ":in" << rb_id2name(this_func) << endl;
}
else
{
cout << ruby_sourcefile << ":" << ruby_sourceline << endl;
}
}
}
void show_exception_info()
{
if (NIL_P(ruby_errinfo))
return;
VALUE errat = rb_funcall(ruby_errinfo, rb_intern("backtrace"), 0);
if (!NIL_P(errat))
{
VALUE mesg = (RARRAY_PTR(errat))[0];
if (NIL_P(mesg))
{
show_error_pos();
}
else
{
cout << string(RSTRING_PTR(mesg), RSTRING_LEN(mesg));
}
}
VALUE eclass = CLASS_OF(ruby_errinfo);
char* einfo;
int elen;
int state;
VALUE estr = rb_protect(rb_obj_as_string, ruby_errinfo, &state);
if (state)
{
einfo = "";
elen = 0;
}
else
{
einfo = RSTRING_PTR(estr);
elen = RSTRING_LEN(estr);
}
if (eclass == rb_eRuntimeError && elen == 0)
{
cout << ": unhandled exception" << endl;
}
else
{
VALUE epath;
epath = rb_class_path(eclass);
if (elen == 0)
{
cout << ": " << string(RSTRING_PTR(epath), RSTRING_LEN(epath)) << endl;
}
else
{
char* tail = 0;
int len = elen;
if ((RSTRING_PTR(epath))[0] == '#')
epath = 0;
if (tail = strchr(einfo, '\n'))
{
len = tail - einfo;
tail++;
}
cout << ": " << string(einfo, len);
if (epath)
{
cout << " (" << string(RSTRING_PTR(epath), RSTRING_LEN(epath)) << endl;
}
if (tail)
{
cout << string(tail, elen - len - 1) << endl;
}
}
}
if (!NIL_P(errat))
{
const int TRACE_HEAD = 8;
const int TRACE_TAIL = 5;
const int TRACE_MAX = TRACE_HEAD + TRACE_TAIL + 5;
RArray* ep = RARRAY(errat);
long len = RARRAY_LEN(errat);
for (int i = 1; i < len; ++i)
{
if (TYPE((RARRAY_PTR(errat))[i]) == T_STRING)
{
cout << " from " << string(RSTRING_PTR((RARRAY_PTR(errat))[i]), RSTRING_LEN((RARRAY_PTR(errat))[i])) << endl;
}
if (i == TRACE_HEAD && len > TRACE_MAX)
{
cout << " ... " << len - TRACE_HEAD - TRACE_TAIL << "ld levels..." << endl;
i = len - TRACE_TAIL;
}
}
}
}
The last part is the function that loads and executes a Ruby script (
execute_ruby
) and the main function. It registers output hook functions in the Ruby interpreter by rb_define_singleton_method
and rb_define_global_function
.
void execute_ruby(const char* pScriptName)
{
int state = 0;
extern VALUE rb_defout;
typedef VALUE (*rubyfunc)(...);
rb_defout = rb_obj_alloc(rb_cObject);
rb_define_singleton_method(rb_defout, "write", (rubyfunc)ruby_write_hook, 1);
rb_define_global_function("p", (rubyfunc)ruby_p_hook, -1);
// Always need a full path
rb_load_protect(rb_str_new2(pScriptName), 0, &state);
if (state)
{
switch (state)
{
case 0x1: // TAG_RETURN
cout << "unexpected return" << endl;
show_error_pos();
break;
case 0x2: // TAG_BREAK
cout << "unexpected break" << endl;
show_error_pos();
break;
case 0x3: // TAG_NEXT
cout << "unexpected next" << endl;
show_error_pos();
break;
case 0x4: // TAG_RETRY
cout << "retry outside of rescue clause" << endl;
show_error_pos();
break;
case 0x5: // TAG_REDO
cout << "unexpected redo" << endl;
show_error_pos();
break;
case 0x6: // TAG_RAISE
case 0x8: // TAG_FATAL
show_exception_info();
break;
default:
cout << "unknown longjmp status " << state << endl;
break;
}
}
rb_gc();
}
int _tmain(int argc, _TCHAR* argv[])
{
init_ruby();
execute_ruby("test.rb");
ruby_finalize();
return 0;
}
The
rb_load_protect
function is the protected version of the rb_load
function that can suppress exceptions. Unless there's a special reason you should always use the protected version of Ruby functions. After a script is evaluated, it shows error information if any. rb_gc
is called at the end to invoke the garbage collector.When this program is executed it loads the "test.rb" Ruby script. My "test.rb" was a simple script that does print some strings and apparently it could output characters through the redefined hook functions. I thought that my experiment was a success. But I did one more test. The new test fed the Ruby downloader script which I described above to this test code. To my surprise, it halted with the "memory error" in the Ruby interpreter. I had no idea what's wrong, but apparently the culprit was the
net/http
library. The library path is in the $LOAD_PATH
variable, which means it's not a loading error. I tried to debug it but the Ruby library was compiled without debug symbols as explained at the beginning of this section. I was really disappointed as I thought the experiment was almost a success. Eventually I lost my interest in embedding Ruby into my application. It is known that the Ruby author himself admits that the weakness of the current Ruby implementation is in embedding because it can't have multiple interpreters side by side, but I didn't know there was such a basic issue was left. It's expected the next version of the Ruby VM addresses some of the issues in embedding, but apparently it's far off in 2008. There may be some workaround if I look more carefully into the mod_ruby source code, but I'm out of energy for further research for the time being. If you know a solution please email me, thanks in advance.The last hope is naturally on Perl. First you need the Perl source code. I used a Perl development snapshot available at here but snapshots of the development trunk in this directory often fail to build or pass the test. After looking at the latest version number of the stable release at Perl.com , you may pick up the latest stable snapshot in another directory. You can find exactly which snapshot is broken at the perl.perl5.porters newsgroup.
To build it on Windows, edit the
Makefile
in the win32 direcrtory (in most cases nothing more is needed than just editing the install path and the compiler selection). Then open a VC++/Platform SDK build environment console and type nmake
to start the building process. After it's done type nmake test
to test it and type nmake install
to build and install all Perl libraries and the Perl interpreter into the install path you specified in the Makefile
.General information on how to embed a Perl interpreter into your C/C++ application is available in the perlembed document. However, I recommend you to browse perlguts and perlcall before going to perlembed. Also perlapi may be useful for a reference. perlguts explains how the C implementation of Perl represents Perl's data types and subroutines in C. perlcall explains how to call Perl subroutines from C. If you still have questions after reading these documents, you can always grep the Perl source code for functions, function macros and comments.
For an actual embedding code sample, mod_perl is probably the most complete. If you read
mod_perl.c
in the src\modules\perl directory of mod_perl 2, you'll find the modperl_response_handler_cgi
function as the main function that executes a Perl CGI script. The basic flow of this function is1. Setup environmental variables passed to Perl
2. Override Perl's standard input and standard output
3. Process an HTTP request to a Perl CGI
4. Restore everything back
Basically it's in line with what I wrote I wanted to do earlier in this article. Apache does most of these things by directly manipulating Perl by C code. But I don't need as much security as the Apache module does, since the current version of the DICE doesn't support multi-user. Running multiple interpreters is not required for now since the context switching method by
PERL_SET_CONTEXT
in the current Perl implementation sucks. It's based on heavy use of macro and requires nasty macro hacks which will most likely have unwanted effects in unexpected areas in a large project. The Perl source code predetermines the name of an interpreter object as "my_perl" and predefined macros can't take other names, which means you have to assign your Perl interpreter pointer to a local variable PerlInterpreter* my_perl
everytime you have to do something with it. Also the perl header file redefines many symbols for its own API, which means if your C++ class has a member function named write
it conflicts with Perl. So my advice is, don't include perl.h
in a header file that is included by many other files in your project. Include it at the beginning of a .cpp source file and put all Perl-related things in it. But how can you define a class that has a pointer to a Perl interpreter as its property? You can put a char*
or something equivalent in the place of the pointer to an interpreter in your class declaration and cast it to PerlInterpreter*
everytime you use it in the .cpp source file.I decided to offload most of the sandbox work to Perl to keep the C++ part simple. The basic design is like this:
1. The Perl interpreter in the DICE loads a setup Perl script which sets
STDIN
/STDOUT
hooks and offers other necessary utilities for sandboxing2. Whenever a CGI script is called, if it's the first time, the Perl interpreter evaluates it as a subroutine in a unique package by using
eval
in Perl, then it's cached as a compiled code. If it is an already compiled code in the cache, it just returns a reference to it.3. The Perl interpreter executes a compiled Perl code by dereferencing its subroutine reference and the C++ part gets the output by reading a Perl variable to collect all Perl output. The C++ part is protected and synchronized by a critical section that keeps this Perl variable alive (though there's certain performance penalty)
Let's take a look at the Perl setup script (
EmbeddedPerlSandbox.pl
). This script was written for the DICE but a newer version may be included in the latest package. To see how it works you should download the DICE and run its web server by yourself. First, it defines the package EmbeddedPerlSandboxOut
that intercepts and overrides the standard out (STDOUT
in Perl). The required Perl version is specified as 5.8.0, but a lower version may work too.
# EmbeddedPerlSandbox for DICE
# (c) RyuK 2006 All Rights Reserved
#
# klassphere[at.mark]gmail.com
# http://aiueo.da.ru/
# http://zzz.zggg.com/
#
# This file is not redistributable.
#
use 5.8.0;
package EmbeddedPerlSandboxOut;
sub TIEHANDLE
{
my $classname = shift;
my $buffer = "";
bless \$buffer, $classname;
}
sub PRINT
{
my $buffer = shift;
my $s = shift;
$$buffer .= $s;
}
sub PRINTF
{
my $buffer = shift;
my $s = shift;
$$buffer .= sprintf($s, @_);
}
sub WRITE
{
my ($buffer, $data, $len, $offset) = @_;
if (!defined($len))
{
$len = length($data);
}
if (!defined($offset))
{
$offset = 0;
}
$$buffer .= substr($data, $offset, $len);
}
sub READLINE
{
my $buffer = shift;
return $$buffer;
}
sub BINMODE
{
# does nothing
return 1;
}
sub CLOSE
{
my $buffer = shift;
undef $buffer;
}
sub DESTROY
{
my $buffer = shift;
undef $buffer;
}
When a Perl script is executed, this class is associated with
STDOUT
by the tie
Perl function which calls the TIEHANDLE
subroutine. It creates an object which is just a scalar value that works as a buffer to hold the content of the standard output. After that, if print
tries to write something onto STDOUT
the PRINT
function in this class is called. READLINE
is called by <>
.The next section defines
EmbeddedPerlSandboxIn
to override the standard input in the same way.
package EmbeddedPerlSandboxIn;
sub TIEHANDLE
{
my $classname = shift;
my $buffer = shift;
bless \$buffer, $classname;
}
sub READLINE
{
my $buffer = shift;
# substr EXPR,OFFSET,LENGTH,REPLACEMENT
return (length($$buffer) ? substr(
$$buffer,
0,
(defined($/) ? index($$buffer, "$/"): length($$buffer) - 1) + 1,
""
)
: undef);
}
sub READ
{
my ($buffer, $len, $offset) = ($_[0], $_[2], $_[3]);
if (!defined($offset))
{
$offset = 0;
}
if ($len > length($$buffer))
{
$len = length($$buffer);
}
# You can use the substr() function as an lvalue
substr($_[1], $offset, $len) = substr($$buffer, 0, $len, "");
return $len;
}
sub GETC
{
my $buffer = shift;
return (length($$buffer) ? substr(
$$buffer,
0,
1,
""
)
: undef);
}
sub BINMODE
{
# does nothing
return 1;
}
sub CLOSE
{
my $buffer = shift;
undef $buffer;
}
sub DESTROY
{
my $buffer = shift;
undef $buffer;
}
The
TIEHANDLE
function receives and holds a scalar value as the content of the standard input when it's associated with STDIN
. For READ
, the second parameter ($_[1]
) is a reference (not a Perl "reference", but in the broader computer science terminology) to the receiving variable, since Perl subroutines are call-by-reference.From here on the main class EmbeddedPerlSandbox is defined.
package EmbeddedPerlSandbox;
$SANDBOX_OUTPUT = "";
%CODE_STORE = {};
BEGIN
{
push @INC, Win32::GetCwd() . "\\perl_lib";
*CORE::GLOBAL::exit = \&EmbeddedPerlSandbox::exit;
*CORE::GLOBAL::flock = \&EmbeddedPerlSandbox::flock;
}
$sandbox_output
is the global package variable that holds the standard output. The C++ part retrieves the output of a Perl user script by referencing this variable later. %CODE_STORE
is a cache that holds a compiled Perl script with its unique package name as a key. The BEGIN
block gets evaluated when the C++ code loads this setup script. @INC
is the array of library paths. The "perl_lib" directory under the current directory is added to it. To obtain the current directory it uses Win32::GetCwd
which is a function in a builtin class for the Windows version of Perl without loading the Cwd
module. Also it overrides the exit
and the flock
builtin functions by assigning subroutine references to their type globs. The exit
function ends the Perl interpreter by calling exit()
in C, which means not only the Perl interpreter but also the entire application just ends its execution if it's called. Such a situation must be avoided for an obvious reason. The flock
function should work in theory even in the Windows version, but when I tried it it stopped execution there due to an unknown reason. As I explained already the C++ section is synchronized by a critical section, so the Perl section is synchronized too. I decided to trap flock
and substitute it with an NOP function.The next subroutine is necessary to make a user script behave like a Perl CGI. If a persistent interpreter just executes the same compiled code twice, all global variables persist in the second execution. For example,
if (!defined($var))
{
$var = 1;
}
print "$var\n";
$var++;
if this code is executed repeatedly in a persistent interpreter the displayed number continues to increase. This behavior is OK for a Java servlet and other persistent web applications, but not desirable if a Perl CGI is the target of emulation. It can be avoided by clearing global package variables previously defined in the package before executing a user code. This
cleanupSymbolTable
subroutine does the job by scanning the symbol table hash (aliased as "stash" in the Perl jargon) of the user code package. In the Perl language, the reflection system is readily available in the form of the symbol table and it's almost too exposed if you ask me. Though I don't like it because it can be a source of nasty Perl hacks, the reflection itself is certainly a necessary feature when creating a plugin system or a self-contained world of DSL.
sub cleanupSymbolTable
{
my $id = shift;
while (my ($name, $glob_entry) = each %{$id . "::"})
{
local *v = $glob_entry;
if (defined($v))
{
undef $v;
}
if (defined(@v))
{
undef @v;
}
if (defined(%v))
{
undef %v;
}
}
}
$id
is the name of the user code package. It enumerates all symbols in that package and receives a typeglob for its symbol (See the "Symbol Tables" in perlmod). If there is a variable (scalar/array/hash) defined within that typeglob it's undefined. Note that it tests all value types as there may be a typeglob that has variables of all the 3 types defined.The next subroutine compiles a user code as the name suggests.
sub compile
{
my ($id, $content, $current_dir) = @_;
if (!length($content))
{
return $CODE_STORE{$id};
}
local $SIG{__WARN__} = \&warn;
# the part undefining global variables must come before $content,
# otherwise $content can't contain 'use strict'
my $sandbox =<< "SBOX";
package $id;
sub __wrapper
{
EmbeddedPerlSandbox::cleanupSymbolTable("$id");
\@ARGV = \@_;
$content ;
}
SBOX
my @tmp = @INC;
push @INC, $current_dir;
eval $sandbox;
@INC = @tmp;
if ($@)
{
my $e = $@;
if ($e !~ /via package/ && $e =~ /line (\d+)/)
{
my $n = $1;
my $m = $n - 6;
$e =~ s/line $n/line $m/g;
}
$e =~ s/\(eval \d+\) //g;
$SANDBOX_OUTPUT = "Content-Type: text/plain\n\nPerl Compilation Error: " . $e . "\n";
}
return $CODE_STORE{$id} = *{$id . "::__wrapper"}{CODE}; # CODE reference - see perlsub
}
$id
is the unique package name for this script. It has to be a unique name to avoid a conflict with other scripts compiled later in this interpreter instance. $content
is the actual Perl CGI script content. If it's empty it just returns an already compiled code for this unique package name. In the actual implementation in the DICE the C++ part checks the file time of a script to see if it has to be reloaded or not, but it can be implemented in Perl too if it's preferable for some reason. Though it may not be necessary, the __WARN__
signal is trapped with the warn
subroutine to suppress a potential interrupt. The $sandbox
here-document is the code that wraps a user code into a unique package that has only a wrapper subroutine code in it. A user code is embedded in this wrapper function and placed after the symbol table cleaner function explained above. In this "here document" you have to add "\
" to a variable that should not be evaluated at the time of the definition of $sandbox
. $current_dir
has the current directory for this script. Since the use
function in Perl is evaluated at the compile time you have to add the current directory for this script to @INC
and restore it after the evaluation of a user code by eval
. If a script has an error, $@
has the error information. When evaluation is done, the reference to a compiled code can be obtained by getting the "CODE" entry in the subroutine's typeglob.These subroutines are just utility functions.
clearCodeStore
is a function that clears the code store as the name suggests. When you run a persistent interpreter you can call this function to reclaim unused resources. addRuntimeError
is an error reporting function in executing a user code.
sub clearCodeStore
{
%CODE_STORE = ();
}
sub addRuntimeError
{
my $e = shift;
if (length($SANDBOX_OUTPUT))
{
if ($SANDBOX_OUTPUT !~ /Perl Runtime Error/)
{
$SANDBOX_OUTPUT .= "Perl Runtime Error: " . $e . "\n";
}
}
else
{
$SANDBOX_OUTPUT = "Content-Type: text/plain\n\nPerl Runtime Error: " . $e . "\n";
}
}
The
execute
subroutine executes a compiled user code.
sub execute
{
my ($id, $compiled, $stdin, $env, $current_dir, $clear_codestore) = @_;
local $SIG{__WARN__} = \&warn;
if (!chdir($current_dir))
{
addRuntimeError("chdir: " . $current_dir);
return;
}
my $oh = tie(*STDOUT, "EmbeddedPerlSandboxOut");
my $ih = tie(*STDIN, "EmbeddedPerlSandboxIn", $stdin);
my %env_vars = ();
foreach my $kv (split(/\n/, $env))
{
if ($kv =~ /([^=]+)=(.+)/)
{
$env_vars{$1} = $2;
}
}
%main::ENV = %env_vars;
my @tmp = @INC;
push @INC, $current_dir;
eval {$compiled->();}; # you need this semicolon...
@INC = @tmp;
if ($@)
{
unless ($@ =~ /EmbeddedPerlSandbox::exit/)
{
my $e = $@;
if ($e !~ /via package/ && $e =~ /line (\d+)/)
{
my $n = $1;
my $m = $n - 6;
$e =~ s/line $n/line $m/g;
}
$e =~ s/\(eval \d+\) //g;
addRuntimeError($e);
}
}
undef $ih;
untie *STDIN;
if (!$@)
{
$SANDBOX_OUTPUT = <STDOUT>;
}
undef $oh;
untie *STDOUT;
if ($clear_codestore)
{
clearCodeStore();
}
}
First it changes the current directory to the directory of the user script by calling
chdir
. I tried to move the current directory in the C++ side by the SetCurrentDirectory
Windows API and the PerlDir_chdir
Perl C API, but for some reason the Perl side was not affected by them. Then it overrides the standard input and the standard output by using tie
and the EmbeddedPerlSandboxIn
/EmbeddedPerlSandboxOut
classes as explained already. Unfortunately it seems not to be able to override the implicit STDIN
expression in <>
. $stdin
passes the content of the standard input which is the input by a web user in the case of a web server. $env
has environmental variables for CGI. As it's passed in a scalar value delimited by '\n' and '=' it's parsed and stored in the %main::ENV
hash to which a CGI code refers. $compiled
is the code reference and it's invoked by $compiled->()
subroutine expression in the eval
block. After it's done the content of the standard output is saved in $SANDBOX_OUTPUT
then the standard input and the standard output are restored.The last section is for the hook functions for these dangerous global functions.
sub exit
{
$SANDBOX_OUTPUT = <STDOUT>;
die("EmbeddedPerlSandbox::exit");
}
sub flock
{
# does nothing
return 1;
}
sub warn
{
die("Perl Warning: " . $_[0] . "\n");
}
1;
The
exit
function saves the content of the standard output in $SANDBOX_OUTPUT
and stops execution by the die
function. As the die
function can be trapped by eval
, the exit
function can be emulated this way. flock
does nothing and warn
is almost the same with exit
. A Perl module has to return 1 at the end.That's all for the Perl side (
EmbeddedPerlSandbox.pl
). Let's check out the C++ sample code that drives a Perl interpreter. To build it you have to add the Perl source code directory to the include directories of your project and then add the perl59.lib
static library to the linker input. The resultant executable requires perl59.dll
to run.
#include <string>
#include <cstdio>
using namespace std;
#include <EXTERN.h>
#include <perl.h>
EXTERN_C void xs_init (pTHX);
EXTERN_C void boot_DynaLoader (pTHX_ CV* cv);
EXTERN_C void xs_init(pTHX)
{
char *file = __FILE__;
dXSUB_SYS;
/* DynaLoader is a special case */
newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file);
}
The
xs_init
glue code is picked up from the skeleton produced by the ExtUtils::Embed
tool (See "Using Perl modules, which themselves use C libraries, from your C program" in perlembed).
int _tmain(int argc, _TCHAR* argv[])
{
int argc_perl = 0;
char* embedding[] = {"", "E:\\program\\src\\perlembedtest2\\EmbeddedPerlSandbox.pl"};
PERL_SYS_INIT(&argc_perl, (char***)embedding);
PerlInterpreter* my_perl = perl_alloc();
perl_construct(my_perl);
PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
perl_parse(my_perl, xs_init, 2, embedding, NULL);
perl_run(my_perl);
The
embedding
char pointer array is an argument list passed to the Perl interpreter. In the second parameter it sets the path of the EmbeddedPerlSandbox.pl
. Then it creates and runs a Perl interpreter.The next part calls the
EmbeddedPerlSandbox::compile
subroutine in the Perl side to compile a Perl script. The unique package name for this script is in id
and the script content is in content
for the sake of this sample code.
const char* id = "test";
string content(
"my $data = ''; read(STDIN, $data, 2); print \"$data\n\"; print getc(STDIN) . \"\n\";"
"while (<STDIN>) {print ($i++ . ': ' . $_);}"
"open FH, '>log.txt';print FH 'test'; close(FH);"
);
dSP;
ENTER;
SAVETMPS;
PUSHMARK(SP);
XPUSHs(sv_2mortal(newSVpv(id, 0)));
// don't use 0 as the second argument since Perl uses strlen for 0
XPUSHs(sv_2mortal(newSVpv(content.c_str(), content.size())));
PUTBACK;
call_pv("EmbeddedPerlSandbox::compile", G_EVAL);
SPAGAIN;
SV* sandbox = 0;
if (SvTRUE(ERRSV))
{
POPs; // see perlcall for G_EVAL
printf("ERRSV\n");
printf(SvPVX(ERRSV));
return 0;
}
else
{
sandbox = newSVsv(POPs);
}
PUTBACK;
FREETMPS;
LEAVE;
It pushs two strings
id
and content
in a locally copied Perl stack. The newSVpv
function creates a new Perl scalar value for a string (pv: pointer value). If you set 0 for the second parameter it uses C strlen()
to calculate the length of the pointed string by a char*
pointer. Since a Perl script can contain a null character (though unlikely) this sample gives the length of the script explicitly by using C++ basic_string
for content
. When the parameters are ready, it puts the stack back to Perl by the PUTBACK
macro and calls the subroutine by call_pv
. See the details for Perl stack manipulation in perlcall. If call_pv
is successful it pops the return value from the stack and assigns it to a new Perl scalar value sandbox
that holds a compiled Perl code reference.In this section it actually execute the compiled Perl code.
in
is a C++ basic_string
object that has an emulated standard input. env
is supposed to have environmental variables, but this sample code omits it. new_dir
is the current directory for this Perl code. As the sample Perl script fiddles with a file, it is created in this directory. Though the EmbeddedPerlSandbox::execute
in Perl takes 6 parameters, the 6th parameter is omitted again for this sample. It calls EmbeddedPerlSandbox::execute
with 5 arguments with no return value expected (G_VOID
).
string in("abcHello world\nI think\nTherefore I am\n");
string env;
string new_dir("E:\\program\\src\\perlembedtest2\\Debug");
ENTER;
SAVETMPS;
PUSHMARK(SP);
XPUSHs(sv_2mortal(newSVpv(id, 0)));
XPUSHs(sandbox);
XPUSHs(sv_2mortal(newSVpv(in.c_str(), in.size())));
XPUSHs(sv_2mortal(newSVpv(env.c_str(), env.size())));
XPUSHs(sv_2mortal(newSVpv(new_dir.c_str(), new_dir.size())));
PUTBACK;
call_pv("EmbeddedPerlSandbox::execute", G_VOID);
SPAGAIN;
STRLEN n_a;
const char* output = SvPV(get_sv("EmbeddedPerlSandbox::SANDBOX_OUTPUT", FALSE), n_a);
int len = (int)n_a;
printf("returned: %d bytes\n", len);
// it can contain a null character
for (int i = 0; i < len; ++i)
{
if (i == 0)
printf("[%c", output[i]);
else
printf("%c", output[i]);
}
if (len != 0)
printf("]\n");
PUTBACK;
FREETMPS;
LEAVE;
After
EmbeddedPerlSandbox::execute
is executed, it retrieves the Perl scalar value of EmbeddedPerlSandbox::SANDBOX_OUTPUT
by calling get_sv
and converting it into a char*
by SvPV
. output
is the string that holds the standard output of the Perl script.
PL_perl_destruct_level = 0;
perl_destruct(my_perl);
perl_free(my_perl);
PERL_SYS_TERM();
return 0;
}
When the work is done it frees the Perl interpreter. It's the end of the C++ sample code. I implemented it in the DICE with additional care for a persistent interpreter.
The winner of this round is Perl. But it's not free from a problem, for example its header file redefines common API names with C macros. It's not designed to host multiple interpreters and plagued with excessive C macro use. I hope this mess is cleaned up in Perl 6 with a new VM (Parrot), just like Ruby's new VM (YARV) plans. If Perl 6 becomes the mainstream this article has to be rewritten since the embedding method should be fairly different from the current one, though I somehow doubt the likelihood that Perl 6 becomes too popular. One thing for sure is I'm satisfied with the relative robustness of Perl, for now.
コメント
コメントを投稿