Postgresql kettastruktuurist

Allikas: Lambda

Enar Reilenti koostatud ülevaade

klaster - ühe serveri instance'i poolt hallatav failide kogum

ühes masinas võib olla mitu klastrit ja serverit 

igal klastril oma conf (postgresql.conf, pg_hba.conf, pg_ident.conf)

klastri juur on tavaliselt dir data (/var/lib/pgsql/data) - PGDATA keskkonnamuutuja

Data dir'i sees:
Item                             Description
---------------------------------------------------------
* PG_VERSION	   A file containing the major version number of PostgreSQL
* base		   Subdirectory containing per-database subdirectories
* global	   Subdirectory containing cluster-wide tables, such as pg_database
* pg_clog  	   Subdirectory containing transaction commit status data
* pg_multixact	   Subdirectory containing multitransaction status data (used for shared row locks)
* pg_subtrans	   Subdirectory containing subtransaction status data
* pg_tblspc  	   Subdirectory containing symbolic links to tablespaces
* pg_twophase  	   Subdirectory containing state files for prepared transactions
* pg_xlog 	   Subdirectory containing WAL (Write Ahead Log) files
* postmaster.opts  A file recording the command-line options the server was last started with
* postmaster.pid   A lock file recording the current server PID and shared memory segment ID (not present after server shutdown

iga (kasutaja) baasi kohta on data/base/ dir'is subdir, mille nimi on andmebaasi nimele vastav ID (OID)
klastri globaalsest tabelist pg_database

näide:
===============================================================================
perf=# SELECT dаtnаme, oid FROM pg_dаtаbаse;

  dаtnаme  |  oid

-----------+-------

 perf      | 16556

 templаte1 |     1

 templаte0 | 16555

--------------------------------------------------------------------

$ cd ./bаse

$ ls -l

totаl 12

drwx------    2 postgres pgаdmin      4096 Jаn 01 20:53 1

drwx------    2 postgres pgаdmin      4096 Jаn 01 20:53 16555

drwx------    3 postgres pgаdmin      4096 Jаn 01 22:38 16556

===============================================================================

iga tabeli ja indeksi jaoks on OID nimega dir'is oma fail - nimetatakse 'filenode'

filenod'i nimi tuleb tabelist: pg_class.relfilenode

enamasti langeb relfilenode kokku OID'iga, aga, ajutisel tabelil ei ole relfilenode'i,
kirje pg_class tabelis on ja OID ka on

TRUNCATE, REINDEX, CLUSTER ja ALTER TABLE võivad muuta filenode'i, OID ei muutu


näiteks:
===============================================================================
templаte1=# SELECT oid, relnаme FROM pg_clаss ORDER BY oid;

  oid  |             relnаme

-------+---------------------------------
  1247 | pg_type

  1249 | pg_аttribute

  1255 | pg_proc

  1259 | pg_clаss

  1260 | pg_shаdow

  1261 | pg_group

  1262 | pg_dаtаbаse

 16384 | pg_аttrdef
===============================================================================

kui tabeli või indeksi faili suurus läheb üle 1GB, siis tehakse segmendid:
esimene segment on algne filenode, teine filenoad.1, kolmas filenode.2 jne

iga tabeli või indeksi fail jaguneb page'ideks

näide:
===============================================================================
perf=# SELECT relnаme, oid, relpаges, reltuples FROM pg_clаss

   relnаme    | oid  | reltuples | relpаges

--------------+------+-----------+----------

 pg_type      | 1247 |       143 |        2

 pg_аttribute | 1249 |       795 |       11

 pg_proc      | 1255 |      1263 |       31

 pg_clаss     | 1259 |       101 |        2

-------------------------------------------------------------

$ ls -l 1247 1249

-rw-------    1 postgres pgаdmin     16384 Jаn 01 20:53 1247

-rw-------    1 postgres pgаdmin     90112 Jаn 01 20:53 1249


 16384/2 = 8192
 90122/11 = 8192
 eаch pаge is 8192 bytes - saab muuta serveri kompileerimise käigus

===============================================================================

veel üks näide:
===============================================================================

robots=# select relname, oid,relfilenode, relpages,reltuples from pg_class order by reltuples;
                              relname          |  oid  | relfilenode | relpages | reltuples
 
...
 strings                                       | 16389 |       16389 |        4 |       560
 strings_string_key                            | 16395 |       16395 |        6 |       560
 pg_operator                                   |  2617 |        2617 |       13 |       702
 pg_operator_oprname_l_r_n_index               |  2689 |        2689 |       11 |       702
 pg_operator_oid_index                         |  2688 |        2688 |        4 |       702
 pg_description_o_c_o_index                    |  2675 |        2675 |       12 |      1988
 pg_description                                |  2609 |        2609 |       16 |      1988
 pg_proc_oid_index                             |  2690 |        2690 |        7 |      2143
 pg_proc                                       |  1255 |        1255 |       50 |      2143
 pg_proc_proname_args_nsp_index                |  2691 |        2691 |       40 |      2143
 pg_attribute_relid_attnam_index               |  2658 |        2658 |       28 |      2255
 pg_attribute_relid_attnum_index               |  2659 |        2659 |        9 |      2255
 pg_attribute                                  |  1249 |        1249 |       38 |      2255
 pg_depend_depender_index                      |  2673 |        2673 |       26 |      5267
 pg_depend                                     |  2608 |        2608 |       37 |      5267
 pg_depend_reference_index                     |  2674 |        2674 |       28 |      5267
 robodata_property_idx                         | 16441 |       16441 |     2256 |    786039
 robodata_addtime_idx                          | 16442 |       16442 |     2167 |    786039
 robodata_date_idx                             | 16443 |       16443 |     2821 |    786039
 robodata_pkey                                 | 16403 |       16403 |     1728 |    786039
 robodata_subject_idx                          | 16440 |       16440 |     2258 |    786039
 idx                                           | 24782 |       24782 |     3034 |    787235
 robodata_sender_receiver_property_subject_idx | 24787 |       24787 |     3034 |    787241
 robodata                                      | 16397 |       16397 |    10792 |    787241
...


robots=# \d robodata
                                     Table "public.robodata"
  Column   |            Type             |                       Modifiers
-----------+-----------------------------+-------------------------------------------------------
 id        | integer                     | not null default nextval('robobase_id_seq'::regclass)
 sender    | integer                     | not null
 receiver  | integer                     |
 addtime   | timestamp without time zone | not null default now()
 status    | character(1)                | not null default 'A'::bpchar
 subject   | integer                     | not null
 property  | integer                     | not null
 value     | character varying(256)      | not null
 valuetype | integer                     | not null
 rid       | integer                     |
 date      | timestamp without time zone |
 usectime  | integer                     |
 datestamp | timestamp without time zone |
 usecstamp | integer                     |
 source    | integer                     |
 context   | integer                     |
Indexes:
    "robodata_pkey" PRIMARY KEY, btree (id)
    "idx" btree (subject, property, addtime)
    "robodata_addtime_idx" btree (addtime)
    "robodata_date_idx" btree (date)
    "robodata_property_idx" btree (property)
    "robodata_sender_receiver_property_subject_idx" btree (sender, receiver, property, subject)
    "robodata_subject_idx" btree (subject)
Foreign-key constraints:
    "robodata_context_fkey" FOREIGN KEY (context) REFERENCES strings(id)
    "robodata_property_fkey" FOREIGN KEY (property) REFERENCES strings(id)
    "robodata_receiver_fkey" FOREIGN KEY (receiver) REFERENCES strings(id)
    "robodata_sender_fkey" FOREIGN KEY (sender) REFERENCES strings(id)
    "robodata_source_fkey" FOREIGN KEY (source) REFERENCES strings(id)
    "robodata_subject_fkey" FOREIGN KEY (subject) REFERENCES strings(id)
    "robodata_valuetype_fkey" FOREIGN KEY (valuetype) REFERENCES strings(id)


postgres@eliko4:~/8.3/main/base/16384$ ls -l 16397
-rw------- 1 postgres postgres 88408064 2009-04-14 09:34 16397


88408064 / 10792 = 8192 bati iga page

88408064 / 787241 = 112.3 baiti iga rea kohta keskmiselt
===============================================================================



page'il on 5 osa:

Item         Description
* PageHeaderData   24 bytes long. Contains general information about the page, including free space pointers.
* ItemIdData       Array of (offset,length) pairs pointing to the actual items. 4 bytes per item.
* Free space       The unallocated space. New item pointers are allocated from the start of this area, new items from the end.
* Items            The actual items themselves.
* Special space    Index access method specific data. Different methods store different data. Empty in ordinary tables.


Items osa 'item' on data tabeli puhul row, index'i puhul node (entry)

PageHeaderData hoiab pointereid (offset) Free space'i algusesse ja lõppu, Special space'ile;
flage;
viimase muutuse xlog andmeid (XLogRecPtr, TimeLineID);
versiooni !:
  PostgreSQL 8.3 alates 4
  PostgreSQL 8.1, 8.2   3
  PostgreSQL 8.0        2
  PostgreSQL 7.3, 7.4   1
  varasemad             0

ItemIdData's on viidad itemitele, iga viit 4 baiti: 2 offset page'i algusest + 2 item'i pikkus.

itemite hulka saab PageHeaderData'st välja rehkendada.

kui item'it page'i sees ümber paigutatakse, siis viida asukoht (array's) jääb samaks!

seega mistahes viidad item'ile (ItemPointer, also known as CTID) esitatakse andmebaasis nii:
viita pageile ja item'i indeksit array's

Special space kas puudub või hoitakse seal access meetodite spetsiifilist datat,
näiteks b-puu indeksi vasaku ja parema naaber-page'i (sibling) viidad




item jaguneb kaheks: muutuva pikkusega header ja data osa

headeris on kindlasti:
  Field       Length   Description
  t_xmin      4 bytes  insert XID stamp
  t_xmax      4 bytes  delete XID stamp
  t_cid       4 bytes  insert and/or delete CID stamp (overlays with t_xvac)
  t_xvac      4 bytes  XID for VACUUM operation moving a row version
  t_ctid      6 bytes  current TID of this or newer row version
  t_infomask2 2 bytes  number of attributes, plus various flag bits
  t_infomask  2 bytes  various flag bits
  t_hoff      1 byte   offset to user data

t_hoff must always be a multiple of the MAXALIGN distance for the platform

lisaks optional headeri osad: null bitmap ja object id

kui aligne'i pärast on vaja jätta tühja ruumi, jäetakse see null bitmapi ja object id vahele

data osa interpreteerimiseks on vaja teiste (süsteemsete) tabelite abi:

tulpade väärtused on järjest ühes baidijadas

pg_attribute tabelist veerud attlen ja attalign määravad välja väärtuse asukoha baidijadas


näide:
=====================================================================
robots=# select count(*) from pg_attribute;
 count                                     
-------                                    
  2262                                     
(1 row)                                    

                                                   
robots=# select * from pg_attribute where attname = 'subject' limit 7;

 attrelid | attname | atttypid | attstattarget | attlen | attnum | attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attisdropped | attislocal | attinhcount
----------+---------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+--------------+------------+-------------
    16397 | subject |       23 |            -1 |      4 |      6 |        0 |          -1 |        -1 | t        | p          | i        | t          | f        | f            | t          |           0
    16440 | subject |       23 |            -1 |      4 |      1 |        0 |          -1 |        -1 | t        | p          | i        | f          | f        | f            | t          |           0
    24782 | subject |       23 |            -1 |      4 |      1 |        0 |          -1 |        -1 | t        | p          | i        | f          | f        | f            | t          |           0
    24787 | subject |       23 |            -1 |      4 |      4 |        0 |          -1 |        -1 | t        | p          | i        | f          | f        | f            | t          |           0
(4 rows)

robots=#
robots=# select * from pg_attribute where attname = 'receiver' limit 7;
 attrelid | attname  | atttypid | attstattarget | attlen | attnum | attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attisdropped | attislocal | attinhcount
----------+----------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+--------------+------------+-------------
    16397 | receiver |       23 |            -1 |      4 |      3 |        0 |          -1 |        -1 | t        | p          | i        | f          | f         | f            | t          |           0
    24787 | receiver |       23 |            -1 |      4 |      2 |        0 |          -1 |        -1 | t        | p          | i        | f          | f         | f            | t          |           0
(2 rows)

=====================================================================

kõige parem olukord on siis, kui kõik väljad on fixed pikkusega ja null'e ei ole


kui välja andmetüüp on muutuva pikkusega (attrlen=-1) jaguneb väli omakorda dataks ja päiseks

päises on öeldud data pikkus ja TOAST info


TOAST - The Oversized-Attribute Storage Technique or "the best thing since sliced bread"

kuna ei lubata data row jagamist mitmesse page'i, siis tuleb suurte väärtuste puhul pakkida

või jagada row osadeks ja panna teise tabelisse.

TOAST'i saab teha ainult muutuva pikkusega väljadel (varlena)

flag bit'id varlena välja päises näitavad, kas tegu on:
tavalise dataga otse väljas
pakitud dataga otse väljas
viidaga (+muud andmed) TOAST'itud datale teises tabelis

kui kasutaja tabelis on võimalikud suured väärtused, siis tehakse automaatselt
selle tabeli TOAST tabel, mille oid kirjutatakse  pg_class.reltoastrelid välja


suured väärtused jagatakse tükkideks - 2000 baiti tavaliselt - 4 tükki mahub 1 page'i

iga tükk läheb row'ks TOAST tabelis

chunk_id ja chunk_seq väljad seovad (loogiliselt) tükid kokku

päris tabeli row väärtuse kohal on TOAST tabeli oid ja chunk_id (+suurused)

seda, kui suur peab välja väärtus olema, et TOAST rakenduks, saab konfida 

tabeli tulpadel võib olla 4 erinevat TOAST strateegiat:
PLAIN    -  ei pakkimist, ei TOAST'i
EXTENDED -  algul üritatakse pakkida, kui ei piisa, siis TOAST'ida
EXTERNAL -  pakkimist ei ole lubatud (tekstiotsinguks)
MAIN     -  pakkimine on lubatud, TOAST'i kasutatakse viimases hädas, kui muul viisil ei saa rida väikseks teha

ALTER TABLE SET STORAGE - TOAST strateegia muutmiseks