Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Pythonizer: Translator from Perl to Python

Pythonizer -- a research project in "fuzzy" translation 

by Dr. Nikolai Bezroukov

News Python for Perl programmers Best books Recommended Links Perl to Python translation Pythonizer user guide Full protocol of translation of pre_pythonizer.pl by the current version of Pythonizer Perl to Python functions map
Program Analysis and Transformations Graph Analysis Parsing Regular expressions Code generation Tree-based code optimization Generative programming methods Performance and Benchmarks
Lexical analysis Recursive Descent Parsing Peephole optimization Python Braces Debate Bit Tricks Humor Etc

Abstract

Pythonizer translates Perl scripts into Python 3.8. It is mainly oriented on sysadmin scripts.  It reorganizes Perl code pushing subroutines up and partially compensates for differences in variable visibility by generating global statement with the list of such variables that is inserted in each Python subroutine definition.

The result usually does not have syntax errors but, of course, semantic of some complex constructs does not match and 10 to 20% of statements require manual editing.

Also can be used for understanding existing Perl code by people who know Python and were assigned to maintain some legacy Perl scripts.

Introduction

Some organizations are now involved in converting their Perl codebase into Python. But a more common task is to maintain existing Perl scripts. Often this task is assigned to programmers who know Python but not Perl (that include most university graduates). In this case, a program that "explains" Perl constructs in Python term would be extremely useful and, sometimes, a lifesaver. Of course, Perl 5 is here to stay (please note what happened with people who were predicting the demise of Fortran ;-), and in most cases, old scripts will stay too.

The other role is to give is a quick start for system administrators who want to learn Python (for example, need to support researchers who work in it) but know only Perl.  This is an interesting approach to teaching the new language for those who know Perl, and know it well. 

The idea here is that it is possible to create such a tool with relatively modest efforts along the ideas of Floyds Evans language with explicit operations on the stack (with or without reductions), or, better, a series of parallel stacks.  A tool, written with some knowledge of compiler technologies, that falls into the category of "small language compliers" with the total effort around one man-year or less. Assuming ten lines per day of debugged code for the task of complexity comparable with the writing of compilers, the estimated size should be around 3-5K lines of code (~1K line phase 1 and 2-4K line phase 2).

Another idea is to use heuristic approaches to the task, recognizing common idioms. We can claim that this approach belong to fashionable now "machine learning," which is an umbrella term for all kind of fancy staff including heuristics. I would call this approach "fuzzy compilation", a kind of "imperfect ", heuristic  peephole optimization." It looks like we can make an "educated guess" for many common cases even without understanding larger context but analyzing just a small fragment  Perl code. Of course, in some case the guess will be wrong, but this is the price to pay.

As of August 2020, around 3.5K codelines were written, and alpha version works more or less OK on simple Perl scripts (sysadmin scripts) with around 80-90% statement translation success rate. Success depends of the style the script is written and favor classic procedural style over OO style. The initial version were designed for Python 2.7. Reworking it for Python 3.8+ since version 0.3 increased the success rate and accuracy considerably, if only that most Perl while loops can now be translated directly and does not require factoring out the statement from the while loop header and replicating it at the bottom.

Since version 0.7 pythonizer performs the first pass which needs to create the table of variables so that global variables can be detected. In the future that might be also used for some conversions. Currently we assume that Perl script doe not abuse implicit conversions string to floating point number. In sysadmin scripts Perl programmers mostly avoid using this functionality.

the main test is early version of pre_pythonizer.pl renamed into maintest.pl.  pre_pythonizer.pl the script that converts Perl script into the form more suitable to processing (optional) by refactoring subroutines, pushing them up, which is required for Python.

The run of pythonizer on this script serves as a kind of the acceptance test, which allows us to detect when the codebase reaches the stage of the alpha version. Of course, this script uses a very small subset of Perl (so called sysadmin Perl; no OO, no modules). Still, in a way it is representative of a large category of Perl script written by system administrators, who rarely use esoteric Perl features (using Perl mostly as "better Bash") and, generally, coming from C+shell background, most of them, especially older folk, prefer the procedural style of programming. Only GUI programs are written using OO-style by this category of programmers. Sometimes Web tools too.

For this category of scripts, the automatic translation (or more correctly transliteration ;-) can provide some labor savings during conversion, allowing to accomplish the task in less time and with higher quality.

Essentially you can start to debug and enhance the converted script on the same day. Of course, when Perl functions and regex are used extensively, many statement will be incorrect.

And Perl has enough idiosyncrasies that prevent even heuristic transliteration, to say nothing about translation. One interesting lesson from writing alpha version is that while Perl superficially has a decently designed lexical level (at least in comparison with BASH ;-) , the devil is in details and lexical scanner for Perl is a very complex undertaking. Probably the half of time spent on the project and slowed it considerably.

Also, the idea that Python is simple, orthogonal language proved to be false. System library in Python is a mess. In some areas, it is more convoluted than Perl and has more ways to accomplish the same task.

Please note that the current version  is alpha version, not beta (traditionally beta are versions are 0.9 - 0.999). So it can crash on some script or go into infinite loop. In this case offending statement can be removed and program resubmitted. That usually helps.

Due to the size of the code and the fact that this is a hobby project currently I do not plan major enhancement like adding automatic conversions. But is time permits, of course, major changes and enhancements are possible.

Another problem what when Python has a similar built-in function as Perl it typically does not match 1:1 Perl functions, and you need to look at the translation with a grain of salt, but that's probably unavoidable. For example I worked on translation of substr for probably two dozen of hours and still I am unsure that I did it right.

 Also, absence of goto, "until" loop difference is precedence of operators, complicated the task further. But simple statements can be translated more or less OK (see below.)

Some missing features can be emulated: right now, double-quoted literals are decompiled and then translated to a sequence of concatenation (see below). In Python 3.8+, they can be compiled into F-strings. Assignment in if statement now is implemented in Python 3.8 via walrus operator, so you do not need to refactor the code and push assignment out of conditionals anymore.

In other words, switching to Python 3.8   considerably simplified Pythonizer  and improved the quality of the translation.

Example of translation

The test below shows the current capabilities of pythonizer. It was run using the source of pre_pythonizer.pl -- the script which was already posted on GitHub (see see pre_pythonizer.pl )/

Full protocol is also available: Full protocol of translation of pre_pythonizer.pl by the current version of pythonizer. Here is a relevant fragment that give some ideas on how automatic transliteration is performed:

  52 | 0 |      |SCRIPT_NAME=__file__[__file__.rfind('/')+1:]                                     #PL: $SCRIPT_NAME=substr($0,rindex($0,'/')+1);
  53 | 0 |      |if (dotpos:=SCRIPT_NAME.find('.'))>-1:                                           #PL: if( ($dotpos=index($SCRIPT_NAME,'.'))>-1 ) {
  54 | 1 |      |   SCRIPT_NAME=SCRIPT_NAME[0:dotpos]                                             #PL: $SCRIPT_NAME=substr($SCRIPT_NAME,0,$dotpos);
  56 | 0 |      |
  57 | 0 |      |OS=os.name # $^O is built-in Perl variable that contains OS name
                                                                                                  #PL:    $OS=$^O;
  58 | 0 |      |if OS=='cygwin':                                                                 #PL: if($OS eq 'cygwin' ){
  59 | 1 |      |   HOME='/cygdrive/f/_Scripts'    # $HOME/Archive is used for backups
                                                                                                  #PL:       $HOME="/cygdrive/f/_Scripts";
  60 | 0 |      |elif OS=='linux':                                                                #PL: elsif($OS eq 'linux' ){
  61 | 1 |      |   HOME=os.environ['HOME']    # $HOME/Archive is used for backups
                                                                                                  #PL:       $HOME=$ENV{'HOME'};
  63 | 0 |      |LOG_DIR=f"/tmp/{SCRIPT_NAME}"                                                    #PL: $LOG_DIR="/tmp/$SCRIPT_NAME";
  64 | 0 |      |FormattedMain=('sub main\n','{\n')                                               #PL: @FormattedMain=("sub main\n","{\n");
  65 | 0 |      |FormattedSource=FormattedSub.copy                                                #PL: @FormattedSource=@FormattedSub=@FormattedData=();
  66 | 0 |      |mainlineno=len(FormattedMain) # we need to reserve one line for sub main
                                                                                                  #PL:    $mainlineno=scalar( @FormattedMain);
  67 | 0 |      |sourcelineno=sublineno=datalineno=0                                              #PL: $sourcelineno=$sublineno=$datalineno=0;
  68 | 0 |      |
  69 | 0 |      |tab=4                                                                            #PL: $tab=4;
  70 | 0 |      |nest_corrections=0                                                               #PL: $nest_corrections=0;
  71 | 0 |      |keyword={'if': 1,'while': 1,'unless': 1,'until': 1,'for': 1,'foreach': 1,'given': 1,'when': 1,'default': 1}
                                                                                                  #PL: %keyword=('if'=>1,'while'=>1,'unless'=>1, 'until'=>1,'for'=>1,'foreach'=>1,'give
                                                                                                  Cont:  n'=>1,'when'=>1,'default'=>1);
  72 | 0 |      |
  73 | 0 |      |logme(['D',1,2]) # E and S to console, everything to the log.
                                                                                                  #PL:    logme('D',1,2);
  74 | 0 |      |banner([LOG_DIR,SCRIPT_NAME,'PREPYTHONIZER: Phase 1 of pythonizer',30]) # Opens SYSLOG and print STDERRs banner; parameter 4 is log retention period
                                                                                                  #PL:    banner($LOG_DIR,$SCRIPT_NAME,'PREPYTHONIZER: Phase 1 of pythonizer',30);
  75 | 0 |      |get_params() # At this point debug  flag can be reset
                                                                                                  #PL:    get_params();
  76 | 0 |      |if debug>0:                                                                      #PL: if( $debug>0 ){
  77 | 1 |      |   logme(['D',2,2])    # Max verbosity
                                                                                                  #PL:       logme('D',2,2);
  78 | 1 |      |   print(f"ATTENTION!!! {SCRIPT_NAME} is working in debugging mode {debug} with autocommit of source to {HOME}/Archive\n",file=sys.stderr,end="")
                                                                                                  #PL: print STDERR "ATTENTION!!! $SCRIPT_NAME is working in debugging mode $debug with
                                                                                                  Cont:   autocommit of source to $HOME/Archive\n";
  79 | 1 |      |   autocommit([f"{HOME}/Archive",use_git_repo])    # commit source archive directory (which can be controlled by GIT)
                                                                                                  #PL:       autocommit("$HOME/Archive",$use_git_repo);
  81 | 0 |      |print(f"Log is written to {LOG_DIR}, The original file will be saved as {fname}.original unless this file already exists ",)
                                                                                                  #PL: say "Log is written to $LOG_DIR, The original file will be saved as $fname.origi
                                                                                                  Cont:  nal unless this file already exists ";
  82 | 0 |      |print('=' * 80,'\n',file=sys.stderr)                                             #PL: say STDERR  "=" x 80,"\n";
  83 | 0 |      |
  84 | 0 |      |#
  85 | 0 |      |# Main loop initialization variables
  86 | 0 |      |#
  87 | 0 |      |new_nest=cur_nest=0                                                              #PL: $new_nest=$cur_nest=0;
  88 | 0 |      |#$top=0; $stack[$top]='';
  89 | 0 |      |lineno=noformat=SubsNo=0                                                         #PL: $lineno=$noformat=$SubsNo=0;
  90 | 0 |      |here_delim='\n' # impossible combination
                                                                                                  #PL:    $here_delim="\n";
  91 | 0 |      |InfoTags=''                                                                      #PL: $InfoTags='';
  92 | 0 |      |SourceText=sys.stdin.readlines().copy                                            #PL: @SourceText=;
  93 | 0 |      |
  94 | 0 |      |#
  95 | 0 |      |# Slurp the initial comment block and use statements
  96 | 0 |      |#
  97 | 0 |      |ChannelNo=lineno=0                                                               #PL: $ChannelNo=$lineno=0;
  98 | 0 |      |while True:                                                                      #PL: while(1){
  99 | 1 |      |   if lineno==breakpoint:                                                        #PL: if( $lineno == $breakpoint ){
 101 | 2 |      |      pdb.set_trace()                                                            #PL: }
 102 | 1 |      |   line=line.rstrip("\n")                                                        #PL: chomp($line=$SourceText[$lineno]);
 103 | 1 |      |   if re.match(r'^\s*$',line):                                                   #PL: if( $line=~/^\s*$/ ){
 104 | 2 |      |      process_line(['\n',-1000])                                                 #PL: process_line("\n",-1000);
 105 | 2 |      |      lineno+=1                                                                  #PL: $lineno++;
 106 | 2 |      |      continue                                                                   #PL: next;
 108 | 1 |      |   intact_line=line                                                              #PL: $intact_line=$line;
 109 | 1 |      |   if intact_line[0]=='#':                                                       #PL: if( substr($intact_line,0,1) eq '#' ){
 110 | 2 |      |      process_line([line,-1000])                                                 #PL: process_line($line,-1000);
 111 | 2 |      |      lineno+=1                                                                  #PL: $lineno++;
 112 | 2 |      |      continue                                                                   #PL: next;
 114 | 1 |      |   line=normalize_line(line)                                                     #PL: $line=normalize_line($line);
 115 | 1 |      |   line=line.rstrip("\n")                                                        #PL: chomp($line);
 116 | 1 |      |   (line)=line.split(' '),1                                                      #PL: ($line)=split(' ',$line,1);
 117 | 1 |      |   if re.match(r'^use\s+',line):                                                 #PL: if($line=~/^use\s+/){
 118 | 2 |      |      process_line([line,-1000])                                                 #PL: process_line($line,-1000);
 119 | 1 |      |   else:                                                                         #PL: else{
 120 | 2 |      |      break                                                                      #PL: last;
 122 | 1 |      |   lineno+=1                                                                     #PL: $lineno++;
 123 | 0 |      |#while
 124 | 0 |      |#
 125 | 0 |      |# MAIN LOOP
 126 | 0 |      |#
 127 | 0 |      |ChannelNo=1                                                                      #PL: $ChannelNo=1;
 128 | 0 |      |for lineno in range(lineno,len(SourceText)):                                     #PL: for( ; $lineno<@SourceText; $lineno++  ){
 129 | 1 |      |   line=SourceText[lineno]                                                       #PL: $line=$SourceText[$lineno];
 130 | 1 |      |   offset=0                                                                      #PL: $offset=0;
 131 | 1 |      |   line=line.rstrip("\n")                                                        #PL: chomp($line);
 132 | 1 |      |   intact_line=line                                                              #PL: $intact_line=$line;
 133 | 1 |      |   if lineno==breakpoint:                                                        #PL: if( $lineno == $breakpoint ){
 135 | 2 |      |      pdb.set_trace()                                                            #PL: }
 136 | 1 |      |   line=normalize_line(line)                                                     #PL: $line=normalize_line($line);
 137 | 1 |      |
 138 | 1 |      |   #
 139 | 1 |      |   # Check for HERE line
 140 | 1 |      |   #
 141 | 1 |      |
 142 | 1 |      |   if noformat:                                                                  #PL: if($noformat){
 143 | 2 |      |      if line==here_delim:                                                       #PL: if( $line eq $here_delim ){
 144 | 3 |      |         noformat=0                                                              #PL: $noformat=0;
 145 | 3 |      |         InfoTags=''                                                             #PL: $InfoTags='';
 147 | 2 |      |      process_line([line,-1000])                                                 #PL: process_line($line,-1000);
 148 | 2 |      |      continue                                                                   #PL: next;
 150 | 1 |      |
 151 | 1 |      |   if default_match:=re.match("""<<['"](\w+)['"]$""",line):                      #PL: if( $line =~/<<['"](\w+)['"]$/ ){
 152 | 2 |      |      here_delim=default_match.group(1)                                          #PL: $here_delim=$1;
 153 | 2 |      |      noformat=1                                                                 #PL: $noformat=1;
 154 | 2 |      |      InfoTags='HERE'                                                            #PL: $InfoTags='HERE';
 156 | 1 |      |   #
 157 | 1 |      |   # check for comment lines
 158 | 1 |      |   #
 159 | 1 |      |   if line[0]=='#':                                                              #PL: if( substr($line,0,1) eq '#' ){
 160 | 2 |      |      if line=='#%OFF':                                                          #PL: if( $line eq '#%OFF' ){
 161 | 3 |      |         noformat=1                                                              #PL: $noformat=1;
 162 | 3 |      |         here_delim='#%ON'                                                       #PL: $here_delim='#%ON';
 163 | 3 |      |         InfoTags='OFF'                                                          #PL: $InfoTags='OFF';
 164 | 2 |      |      elif re.match(r'^#%ON',line):                                              #PL: elsif( $line =~ /^#%ON/ ){
 165 | 3 |      |         noformat=0                                                              #PL: $noformat=0;
 166 | 2 |      |      elif line[0:6]=='#%NEST':                                                  #PL: elsif( substr($line,0,6) eq '#%NEST') {
 167 | 3 |      |         if default_match:=re.match(r'^#%NEST=(\d+)',line):                      #PL: if( $line =~ /^#%NEST=(\d+)/) {
 168 | 4 |      |            if cur_nest!=default_match.group(1):                                 #PL: if( $cur_nest != $1 ) {
 169 | 5 |      |               cur_nest=new_nest=default_match.group(1)                # correct current nesting level
                                                                                                  #PL:                   $cur_nest=$new_nest=$1;
 170 | 5 |      |               InfoTags=f"={cur_nest}"                                           #PL: $InfoTags="=$cur_nest";
 171 | 4 |      |            else:                                                                #PL: else{
 172 | 5 |      |               InfoTags=f"OK {cur_nest}"                                         #PL: $InfoTags="OK $cur_nest";
 174 | 3 |      |         elif re.match(r'^#%NEST++',line):                                       #PL: elsif( $line =~ /^#%NEST++/) {
 175 | 4 |      |            cur_nest=new_nest=default_match.group(1)+1             # correct current nesting level
                                                                                                  #PL:                $cur_nest=$new_nest=$1+1;
 176 | 4 |      |            InfoTags='+1'                                                        #PL: $InfoTags='+1';
 177 | 3 |      |         elif re.match(r'^#%NEST--',line):                                       #PL: elsif( $line =~ /^#%NEST--/) {
 178 | 4 |      |            cur_nest=new_nest=default_match.group(1)+1             # correct current nesting level
                                                                                                  #PL:                $cur_nest=$new_nest=$1+1;
 179 | 4 |      |            InfoTags='-1'                                                        #PL: $InfoTags='-1';
 180 | 3 |      |         elif re.match(r'^#%ZERO\?',line):                                       #PL: elsif( $line =~ /^#%ZERO\?/) {
 181 | 4 |      |            if cur_nest==0:                                                      #PL: if( $cur_nest == 0 ) {
 182 | 5 |      |               InfoTags=f"OK {cur_nest}"                                         #PL: $InfoTags="OK $cur_nest";
 183 | 4 |      |            else:                                                                #PL: else{
 184 | 5 |      |               InfoTags='??'                                                     #PL: $InfoTags="??";
 185 | 5 |      |               logme(['E',f"Nest is {cur_nest} instead of zero. Reset to zero"]) #PL: logme('E',"Nest is $cur_nest instead of zero. Reset to zero");
 186 | 5 |      |               cur_nest=new_nest=0                                               #PL: $cur_nest=$new_nest=0;
 187 | 5 |      |               nest_corrections+=1                                               #PL: $nest_corrections++;
 191 | 2 |      |      process_line([line,-1000])                                                 #PL: process_line($line,-1000);
 192 | 2 |      |      continue                                                                   #PL: next;
 194 | 1 |      |   if default_match:=re.match(r'^sub\s+(\w+)',line):                             #PL: if( $line =~ /^sub\s+(\w+)/ ){
 195 | 2 |      |      SubList[default_match.group(1)]=lineno                                     #PL: $SubList{$1}=$lineno;
 196 | 2 |      |      SubsNo+=1                                                                  #PL: $SubsNo++;
 197 | 2 |      |      ChannelNo=2                                                                #PL: $ChannelNo=2;
 198 | 2 |      |      CommentBlock=0                                                             #PL: $CommentBlock=0;
 199 | 2 |      |      for backno in range(len(FormattedMain)-1,0,-1):                            #PL: for( $backno=$#FormattedMain;$backno>0;$backno-- ){
 200 | 3 |      |         comment=FormattedMain[backno]                                           #PL: $comment=$FormattedMain[$backno];
 201 | 3 |      |         if re.match(r'^\s*#',comment) or re.match(r'^\s*$',comment): #PL: if ($comment =~ /^\s*#/ || $comment =~ /^\s*$/){
 202 | 4 |      |            CommentBlock+=1                                                      #PL: $CommentBlock++;
 203 | 3 |      |         else:                                                                   #PL: else{
 204 | 4 |      |            break                                                                #PL: last;
 207 | 2 |      |      backno+=1                                                                  #PL: $backno++;
 208 | 2 |      |      for backno in range(backno,len(FormattedMain)):                            #PL: for (; $backno<@FormattedMain; $backno++){
 209 | 3 |      |         comment=FormattedMain[backno]                                           #PL: $comment=$FormattedMain[$backno];
 210 | 3 |      |         process_line([comment,-1000])          #copy comment block from @FormattedMain were it got by mistake
                                                                                                  #PL:             process_line($comment,-1000);
 212 | 2 |      |      for backno in range(0,CommentBlock):                                       #PL: for ($backno=0; $backno<$CommentBlock; $backno++){
 213 | 3 |      |         FormattedMain.pop()          # then got to it by mistake
                                                                                                  #PL:             pop(@FormattedMain);
 215 | 2 |      |      if cur_nest!=0:                                                            #PL: if( $cur_nest != 0 ) {
 216 | 3 |      |         logme(['E',f"Non zero nesting encounted for subroutine definition {default_match.group(1)}"]) #PL: logme('E',"Non zero nesting encounted for subroutine definition $1");
 217 | 3 |      |         if cur_nest>0:                                                          #PL: if ($cur_nest>0) {
 218 | 4 |      |            InfoTags='} ?'                                                       #PL: $InfoTags='} ?';
 219 | 3 |      |         else:                                                                   #PL: else{
 220 | 4 |      |            InfoTags='{ ?'                                                       #PL: $InfoTags='{ ?';
 222 | 3 |      |         nest_corrections+=1                                                     #PL: $nest_corrections++;
 224 | 2 |      |      cur_nest=new_nest=0                                                        #PL: $cur_nest=$new_nest=0;
 225 | 1 |      |   elif line=='__END__' or line=='__DATA__':                                     #PL: elsif( $line eq '__END__' || $line eq '__DATA__' ) {
 226 | 2 |      |      ChannelNo=3                                                                #PL: $ChannelNo=3;
 227 | 2 |      |      logme(['E',f"Non zero nesting encounted for {line}"])                      #PL: logme('E',"Non zero nesting encounted for $line");
 228 | 2 |      |      if cur_nest>0:                                                             #PL: if ($cur_nest>0) {
 229 | 3 |      |         InfoTags='} ?'                                                          #PL: $InfoTags='} ?';
 230 | 2 |      |      else:                                                                      #PL: else{
 231 | 3 |      |         InfoTags='{ ?'                                                          #PL: $InfoTags='{ ?';
 233 | 2 |      |      noformat=1                                                                 #PL: $noformat=1;
 234 | 2 |      |      here_delim='"'       # No valid here delimiter in this case !
                                                                                                  #PL:          $here_delim='"';
 235 | 2 |      |      InfoTags='DATA'                                                            #PL: $InfoTags='DATA';
 237 | 1 |      |   if line[0]=='=' and line!='=cut':                                             #PL: if( substr($line,0,1) eq '=' && $line ne '=cut' ){
 238 | 2 |      |      noformat=1                                                                 #PL: $noformat=1;
 239 | 2 |      |      InfoTags='POD'                                                             #PL: $InfoTags='POD';
 241 | 2 |      |      here_delim='=cut'                                                          #PL: }
 242 | 1 |      |
 243 | 1 |      |   # blank lines should not be processed
 244 | 1 |      |   if re.match(r'^\s*$',line):                                                   #PL: if( $line =~/^\s*$/ ){
 245 | 2 |      |      process_line(['',-1000])                                                   #PL: process_line('',-1000);
 246 | 2 |      |      continue                                                                   #PL: next;
 248 | 1 |      |   # trim leading blanks
 249 | 1 |      |   if default_match:=re.match(r'^\s*(\S.*$)',line):                              #PL: if( $line=~/^\s*(\S.*$)/){
 250 | 2 |      |      line=default_match.group(1)                                                #PL: $line=$1;
 252 | 1 |      |   # comments on the level of nesting 0 should be shifted according to nesting
 253 | 1 |      |   if line[0]=='#':                                                              #PL: if( substr($line,0,1) eq '#' ){
 254 | 2 |      |      process_line([line,0])                                                     #PL: process_line($line,0);
 255 | 2 |      |      continue                                                                   #PL: next;
 257 | 1 |      |
 258 | 1 |      |   # comments on the level of nesting 0 should start with the first position
 259 | 1 |      |   first_sym=line[0]                                                             #PL: $first_sym=substr($line,0,1);
 260 | 1 |      |   last_sym=line[-1]                                                             #PL: $last_sym=substr($line,-1,1);
 261 | 1 |      |   if first_sym=='{' and len(line)==1:                                           #PL: if( $first_sym eq '{' && length($line)==1 ){
 262 | 2 |      |      process_line(['{',0])                                                      #PL: process_line('{',0);
 263 | 2 |      |      cur_nest=new_nest+=1                                                       #PL: $cur_nest=$new_nest+=1;
 264 | 2 |      |      continue                                                                   #PL: next;
 265 | 1 |      |   elif first_sym=='}':                                                          #PL: elsif( $first_sym eq '}' ){
 266 | 2 |      |      cur_nest=new_nest-=1                                                       #PL: $cur_nest=$new_nest-=1;
 267 | 2 |      |      process_line(['}',0])       # shift "{" left, aligning with the keyword
                                                                                                  #PL:           process_line('}',0);
 268 | 2 |      |      if line[0]=='}':                                                           #PL: if( substr($line,0,1) eq '}' ){
 269 | 3 |      |         line=line[1:]                                                           #PL: $line=substr($line,1);
 271 | 2 |      |      while line[0]==' ':                                                        #PL: while( substr($line,0,1) eq ' ' ){
 272 | 3 |      |         line=line[1:]                                                           #PL: $line=substr($line,1);
 274 | 2 |      |      # Case of }else{
 275 | 2 |      |      if not last_sym=='{':                                                      #PL: unless( $last_sym eq '{') {
 276 | 3 |      |         process_line([line,0])                                                  #PL: process_line($line,0);
 277 | 3 |      |         continue                                                                #PL: next;
 279 | 2 |      |      if cur_nest==0:                                                            #PL: if( $cur_nest==0 ){
 280 | 3 |      |         ChannelNo=1          # write to main
                                                                                                  #PL:             $ChannelNo=1;
 283 | 1 |      |   # Step 2: check the last symbol for "{" Note: comments are prohibited on such lines
 284 | 1 |      |   if last_sym=='{' and len(line)>1:                                             #PL: if( $last_sym eq '{' && length($line)>1 ){
 285 | 2 |      |      process_line([line[0:-1],0])                                               #PL: process_line(substr($line,0,-1),0);
 286 | 2 |      |      process_line(['{',0])                                                      #PL: process_line('{',0);
 287 | 2 |      |      cur_nest=new_nest+=1                                                       #PL: $cur_nest=$new_nest+=1;
 288 | 2 |      |      continue                                                                   #PL: next;

 

While only a fragment is shown, the program was able to transliterate most of the statements and run to the end. Type conversions were not performed and need to be added manually. Generally the result needd to be verified like by line and the code requires editing and optimization.

As you can see some part of the translation look sub-optional in Python. For example, the way Perl programmer strip the lines of leading and trailing blanks using regex is not not necessary in Python as it has special built-in function:

117 | 1 |      |   if re.match(r'^use\s+',line):           #PL: if($line=~/^use\s+/){
118 | 2 |      |      process_line([line,-1000])           #PL: process_line($line,-1000);

Similarly

112 | 1 |      |   if line[-1:1]== "\r":                    #PL: if( substr($line,-1,1) eq "\r" ){
113 | 2 |      |      line=line[0:-1]                       #PL: chop($line);

Can be replaced by the call to rstrip function: line=line.string.rstrip("\r")

But that's just "suboptimal" staff. There are also obvious incorrect translations, which is to be expected.  For example

(line)=tr(' ',$line,1)

will be translated incorrectly because this is a Perl idiom which removes leading blanks

At the same time it does not mean that improving from the level achieved in the current version of Pythonizer is  easy: in programming that last 20% of functionality usually takes eight time as much to implement, as the debugging and testing as the first 80% of functionality (a kind of another version of Pareto Law)

Based on this, very limited, testing results I hope that the structure of pythonizer (two phase translation, see below) that I have chosen is OK for the limited purpose I have chosen: to simplify initial phase of the conversion and provide text with with you can work in debugger to iron really signicant differences.  It also can serve as an educational tool and reference for those who know Perl well as amount of know-how incorporated in those 4K of code is significant and some translation it performs are non-trivial (for example for tr/://d )

And, of course, the highest possible quality of translation within the limitation of this approach  can't be achieved in a hobby project like this.

But there is a law of diminishing returns here and I need to know were to stop.  Previously I thought that 4K of code is the magical limit  after which the project needed to be stopped. But I already reached this limit and despite that still can make significant improvements in quality of translation  relatively easily (it took me a month to get the program from version 0.2 to version 0.5) do probably the real limit  is somewhat higher.  People who wrote signicant medium sized software (say over 10K source lines) know that feeling when entropy picks up and the code became difficult to modify. This is the point at which on-man project generally needs to stop. 

In any case, the idea of "fuzzy pythonizer" as an experimental approach proved to be valuable and it is suitable for the subset of Perl typically used in sysadmin scripts. Even simple heuristics proved to be working really well. In current version it can help sysadmin to convert simple scripts or to learn Python by trying so. Which actually is my main interest.

The Structure of Pythonizer

So far, conceptually, the translator with work in two phases/passes. The first phase is the "normalization" of the Perl program into something like "one statement in a line" format, to refactoring of the order of subroutines (pushing them up).

The problem arises with comparisons which in Perl are typed and the variables are coerced into chosen type (string vs numeric comparison) but in Python are not typed and you need to ensure compatibility and if necessary the conversion of operands yourself. If we can't guest the types of operands then, for example

if ( $line eq $text )

should probably be translated in

if str(line) == str(text) ...

Pythonizer is marking untranslatable statements or statements parts with appropriate comments using two phase translation:

  1. Pre-pythonizer performs specialized prettyprinting of the code This phase will convert the Perl program into the form that simplify subsequent parching and produce the XREF table with "educated guesses" as for the type of variables in Perl program. It also recognizes Perl script into "subroutines first" mode.

    It also needs to mark multiline statements with #::\ comment. See Neatperl for some ideas used in this phase. Alpha version of this code is posted as pre_pythonizer.pl See https://github.com/softpano/pythonizer

    1. Most Perl programmers do not change the type of variable within the program and string variables always contain string values, while numeric variables contain numeric values. So essentially they are using Python model.
    2. Known exception are $_ and a couple of other connected with matching variables like $1-$9.
  2. Limited lookahead "fuzzy" parsing. As Perl grammar is way too complex a mixture of recursive decent and some ideas from long forgotten Floyd-Evans language is used; see Gries book https://www.amazon.com/Compiler-Construction-Digital-Computers-David/dp/047132776X )

    Lexical phase works not one token at a time, but one statement at atime and some peephole optimization is performed within  each single statement, if necessary. for example refactoring of tail conditionals and conversion of Perl emulation of if statements with the end block consisting on a single statement like ($debug) && function in regular if statements. 

    Pythonize also widely uses the concept of peephole optimization with each statement. In many special cases it replace Perl constructs with different Python constructs as Python library of standard functions is richer then in Perl. 

Note on preprocessing phase

Preprocessing phase should simplify subsequent phase and collect some useful states on variables which allow to guess their type (in most Perl script variable never changes initial type -- a variable that initially was assigned a string value is always assigned a string value, and the variable initially assigned the numeric value is always assigned numeric values for the duration of the program execution.

Alpha version was already implemented ( pre_pythonizer.pl) and the main transformation implemented is refactoring of the code by moving all subroutines up and creating main subroutine out of "nesting zero" statement and a call to it at the bottom.

__DATA__ and __END__ sections are left intact and are written into a separate file by the second phase.

Note on translation phase

The main assumption (which is in many cases is ncorrect, but is a good starting point) is that type of each and every variable are static and does not change during the whole execution of the program.

Some random notes and observations

Preliminary classification performed by the author as well as old Knuth paper (1971) D. Knuth. An empirical study of FORTRAN programs. Software—Practice and Experience , 1:105–133, 1971 ; ( https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/empirical-fortran.pdf ) ; the online version of Stanford research report is available at https://apps.dtic.mil/dtic/tr/fulltext/u2/715513.pdf ) suggests:

The most interesting part here is whether it is possible to match and mix those two parts into a usable product that has less than 4K lines of code.

As this is a hobby project, no timeline is provided, but I expect to create an alpha version in early 2020.

The author would appreciate comments and pointers to useful information by those who are interested in the final product.

NOTE:

Starting in 5.10, you can compile Perl with the experimental Misc Attribute Decoration enabled and set the PERL_XMLDUMP environment variable to a filename to get an XML dump of the parse tree - very helpful for language translators. This might be the best approach for "semi-automatic" conversion, as parsing was already performed. Though as the doc says, this is a work in progress. We do not use this approach.

Fully automatic conversion would require writing a Perl parser, semantic checker, and Python code generator. This is a major project for of probably around ten man years magnitude. So some corners should be cut if we limit the project to one year.

BTW Perl parsers are hard enough for the Perl teams to get right and Perl parser accepts some syntactically incorrect Perl constructs. I think the quote below sums up the theoretical side very well. From Wikipedia: Perl in Wikipedia

Perl has a Turing-complete grammar because parsing can be affected by run-time code executed during the compile phase.[25] Therefore, Perl cannot be parsed by a straight Lex/Yacc lexer/parser combination. Instead, the interpreter implements its own lexer, which coordinates with a modified GNU bison parser to resolve ambiguities in the language.

It is often said that "Only Perl can parse Perl," meaning that only the Perl interpreter (perl) can parse the Perl language (Perl), but even this is not, in general, true. Because the Perl interpreter can simulate a Turing machine during its compile phase, it would need to decide the Halting Problem in order to complete parsing in every case. It's a long-standing result that the Halting Problem is undecidable, and therefore not even Perl can always parse Perl.

Perl makes the unusual choice of giving the user access to its full programming power in its own compile phase. The cost in terms of theoretical purity is high, but practical inconvenience seems to be rare.

Other programs that undertake to parse Perl, such as source-code analyzers and auto-indenters, have to contend not only with ambiguous syntactic constructs but also with the undecidability of Perl parsing in the general case.

Adam Kennedy's PPI project focused on parsing Perl code as a document (retaining its integrity as a document), instead of parsing Perl as executable code (which not even Perl itself can always do). It was Kennedy who first conjectured that, "parsing Perl suffers from the 'Halting Problem'."[26], and this was later proved.[27]

In short you need a lot of thinking and manual work to get such a transformer to the level on which it might be useful.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Oct 05, 2020] Version 0.8 uploaded

Changes since version 0.7

Possibility to generate code for Python 2.7 and option -p were removed.

Option -w added which allows to specify the width of the line in the protocol. The default is 188 -- suitable for Fixsys 11 points font on 24" screen.

More correct translation of array assignments. Some non-obvious bugs in translation were fixed. Now you need to specify PERL5LIB variable pointing it to the directory with modules to run the program. Global variable now are initialized after main sub to undef value to create a global namespace. Previously this was done incorrectly. Simple installer for Python programmers who do not know much Perl added: the problem proved to be useful as a help for understanding Perl scripts by Python programmers.

Significantly more cases of using built-in functions without parenthesis are translated correctly

Changes in pre_pythonizer.pl

Current version by default does not create main subroutine out of statement found on nesting level zero, as it introduces some errors. You need specify option -m to create it.

NOTE: All Python statements on nesting level zero should starts from the beginning of the line which is ugly, but you can enclose them in the dummy if statement

if True: 

to create an artificial nesting level 1

[Sep 08, 2020] Full protocol of translation of pre_pythonizer.pl by the current version of pythonizer

[Aug 31, 2020] Version 0.5 was uploaded

Regular expression and tr function translation was improved. Substr function translation improved. Many other changes and error corrections. Option -r (refactor) implemented to allow refactoring. By default loads and run pre-pythonlizer.pl. As it changes the source, creating a backup, you need to run it only once.

[Aug 22, 2020] Version 0.4 was uploaded

The walrus operator and the f-strings now are used to translate Perl double quoted literals if option -p is set to 3 (default). In this version Python 3.8 is used as the target language.

[Aug 17, 2020] Version 0.3 was uploaded

Changes since version 0.2: default version of Python used is now version 3.8; option -p allows to set version 2 if you still need generation for Python 2.7 ( more constructs will be untranslatable ). See user guide for details.

[Aug 05, 2020] Version 0.2 was uploaded

[Nov 20, 2019] Version 0.070 of pythonizer was able to translate more then 80% of pre_pythonizer code correctly.

[Oct 16, 2019] pre_pythonizer.pl the version 0.1 of phase one was posted.

Produces formatted code that better suits the task of conversion to Python then a regular Perl pretty printer such as Neatperl (actually it was derived from Neatperl)

A very simple example:

The initial code:

#!/usr/bin/perl
#
# Insertion sorting
#
    $total_chunks=scalar(@news_link_vector);
          for( $i=1; $i<@news_link_vector; $i++ ){
            next if( $news_timestamp[$i-1]>=$news_timestamp[$i]);
            $t=$news_timestamp[$i];
            $t1=$news_link_vector[$i];
            for( $j=$i-1; $j>=0 && $news_timestamp[$j]<$t; $j-- ){
               $news_timestamp[$j+1]=$news_timestamp[$j];
               $news_link_vector[$j+1]=$news_link_vector[$j];
            }
            $news_timestamp[$j+1]=$t; # move datastamp
            $news_link_vector[$j+1]=$t1; # move pointer to the record
         } # for
    
The result of formatting of insertion sort algorithm in Perl
PRE_PYTHONIZER: Phase 1 of pythonizer (last modified 191016_0106) Running at 19/10/16 01:06
Logs are at /tmp/pre_pythonizer/pre_pythonizer.191016_0106.log. Type -h for help.

================================================================================


ATTENTION!!! pre_pythonizer is working in debugging mode 1 with autocommit of source to /cygdrive/f/_Scripts/Archive
================================================================================

   0   0      | #!/usr/bin/perl
   1   0      | #
   2   0      | # Insertion sorting
   3   0      | #
   4   0      |    $total_chunks=scalar(@news_link_vector);
   5   0      |    for( $i=1; $i<@news_link_vector; $i++ )
   5   0      |    {
   6   1      |       next if( $news_timestamp[$i-1]>=$news_timestamp[$i]);
   7   1      |       $t=$news_timestamp[$i];
   8   1      |       $t1=$news_link_vector[$i];
   9   1      |       for( $j=$i-1; $j>=0 && $news_timestamp[$j]<$t; $j-- )
   9   1      |       {
  10   2      |          $news_timestamp[$j+1]=$news_timestamp[$j];
  11   2      |          $news_link_vector[$j+1]=$news_link_vector[$j];
  12   1      |       }
  12   1      |
  13   1      |       $news_timestamp[$j+1]=$t; # move datastamp
  14   1      |       $news_link_vector[$j+1]=$t1; # move pointer to the record
  15   0      |    }
  15   0      |    # for
  16   0      |
Name "main::total_chunks" used only once: possible typo at /tmp/pre_pythonizer/insertion_sort_test.pl.formatted.pl line 5.
/tmp/pre_pythonizer/insertion_sort_test.pl.formatted.pl syntax OK


CROSS REFERENCE TABLE

int $i 5, 5, 5, 6, 6, 7, 8, 9

int $j 9, 9, 9, 9, 10, 10, 11, 11

int $total_chunks 4

str $news_link_vector 8, 11, 11

str $news_timestamp 6, 6, 7, 9, 10, 10

str $t 7, 9

str $t1 8

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

BXref - Generates cross reference reports for Perl programs - metacpan.org



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: October 16, 2020