Monday 10 October 2011

PYTHON BYTECODE

                                                        In this blog post we are going to look at the internal representation of the python byte code.When we try to execute a python script there a some common steps which will happen.
1. compile the code and transform the code script to an intermediate representation called byte code .
2. The byte code will be executed by a virtual machine.
By doing this python will become platform independent as the python byte code is not interpreted by the real hardware.

    Internals of python bytecode

Now let us dig into the internals of a python bytecode . The python bytecode is divided into 3 parts mainly
1. first four byte magic number.
The starting of four bytes of a python byte code will be a magic number . The magic number will unique for a version python .
2. second four byte time stamp.
The time stamp field will modified each time the script file is modified . If the script file is modified the python script should be recompiled
and a new byte code should be produced .
3. rest of the code will be the marshelling code .
Thre rest of the code is the marshelling code . Means python byte code instructions along with its operands will be sequencially encoded in this
region.

    Manually compiling a python script.

To manually compile a python code you can use py_compile module.
for example make a python script .

$vim test.py
i = 1


now to compile test.py you can use this script given below.
import sys,py_compile

def main():
  py_compile.compile(argv[1])

if __name__ == '__main__':
  main(sys.argv)
let us call the above script file as compile.py
$python compile.py test.py
This will produce a file named test.pyc in the directory where you have executed the above script .

if you want to see what is inside the test.pyc you can do it by using the linux command.
$od -tx1 test.pyc
0000000 d1 f2 0d 0a fb 73 8e 4e 63 00 00 00 00 00 00 00
0000020 00 01 00 00 00 40 00 00 00 73 0a 00 00 00 64 00
0000040 00 5a 00 00 64 01 00 53 28 02 00 00 00 69 01 00
0000060 00 00 4e 28 01 00 00 00 74 01 00 00 00 69 28 00 
0000100 00 00 00 28 00 00 00 00 28 00 00 00 00 73 07 00
0000120 00 00 74 65 73 74 2e 70 79 74 08 00 00 00 3c 6d
0000140 6f 64 75 6c 65 3e 01 00 00 00 73 00 00 00 00
0000157
The above thing is the hex dump of my test.pyc file . as you can see the first 4 bytes.
is the magic number.
is the time stamp.

to understand the rest of the part we need to see the python byte_code instructions opcode. you can see all the python byte code and its opcode
hear


    Disassembling the python bytecodes

To dis assemble the python code you can use dis module.
$python -m dis test.py

0 LOAD_CONST 0 (1)
3 STORE_NAME 0 (i)
6 LOAD_CONST 1 (None)
9 RETURN_VALUE
this is the internal representaion of the code i = 1 in python byte code
if you wan tot see the full dis assembly of the bytecode test.pyc you can use the script which is given below.

import sys,dis

def main(argv):
  fp = open(argv[1],'r')
  byte_code_string = fp.read()
  dis.dis(byte_code_string)

if __name__ == '__main__':
  if len(sys.argv) < 2:
    print 'usage: python pydis.py '
  else:
    main(sys.argv)
    Writing your own dis assembler

Now with the above details given if you are interested try to make a dis assembler . you can also check this code
https://github.com/tonylijo/python-byte-code-disassembler