ViSolve - Squid Configuration Manual

GLOSSARY

parent

In a parent relationship, the child cache will forward requests to its parent cache. If the parent does not hold a requested object, it will forward the request on behalf of the child. A cache hierarchy should closely follow the underlying network topology. Parent caches should be located along the network paths towards the greater Internet. For example, if your Internet Service Provider (ISP) operates a cache, it should probably be a parent to yours, since your Web traffic will have to travel along your ISP's infra structure anyway.

sibling

In a sibling relationship, a peer may only request objects already held in the cache; a sibling can not forward cache misses on behalf of the peer. The sibling relationship should be used for caches ``nearby'' but not in the direction of your route to the Internet. For example, it may make sense for a number of department-specific caches within an organization to have sibling relationships among them. This approach is even more compelling when there is no parent cache available for the organization as a whole.

Multicast and Unicast

A unicast packet is the complete opposite: one machine is talking to only one other machine. All TCP connections are unicast, since they can only have one destination host for each source host. UDP packets are almost always unicast too, though they can be sent to the broadcast address so that they reach every single machine in some cases.

A multicast packet is from one machine to one or more. The difference between a multicast packet and a broadcast packet is that hosts receiving multicast packets can be on different lans, and that each multicast data-stream is only transmitted between networks once, not once per machine on the remote network. Rather than each machine connecting to a video server, the multicast data is streamed per-network, and multiple machines just listen-in on the multicast data once it's on the network.

Netmask

An IP address has two components, the network address and the host address. For example, consider the IP address 172.16.1.25. Assuming this is part of a Class B network, the first two numbers (172.16) represent the Class B network address, and the second two numbers (1.25) identify a particular host on this network.

Subnetting enables the network administrator to further divide the host part of the address into two or more subnets. In this case, a part of the host address is reserved to identify the particular subnet. This is easier to see if we show the IP address in binary format. The full address is:

11111000.00001000.00000001.00011001 The Class B network part is:
11111000.00001000
and the host address is
00000001.00011001
If the subnetmask for this IP Address is 255.255.255.0,11111111.11111111.11111111.00000000 (binary).
The resultant Subnet Address is got by bitwise AND operations.
If this network is divided into 255 subnets, then the first 8 bits of the host address (00000000) are reserved for identifying the subnet.
11111000.00001000.00000001.00000000 Hence, resultant is 172.16.1.0. It refers IPAddress from 172.16.1.1 to172.16.1.255.

FileSystems in Squid

The cache_dir type in Squid has nothing to do with the underlying filesystem type, it defines the storage method / implementation.

Currently Squid has 4 different implementations:
ufs :- On top of a normal filesystem supporting directories and files.
aufs :- As "ufs", but using threads to implement non-blocking disk I/O
diskd :- As "ufs", but using a separate process to implement non-blocking disk I/O
coss :- An experimental "raw" filesystem, where all objects are stored in one big file.
Other storage methods are being worked upon

Kind of diskd is designed to work around the problem of blocking IO in a unix process. async ufs gets around this by using threads to complete disk IO. diskd uses external processes to complete disk IO.

Asyncufs works just that little bit faster, but only works on systems where threads can do async disk IO without blocking the main process. Systems with user-threads (eg FreeBSD) can not use this effectively. Diskd, being implemented as an external process, gets around this. If cache is slightly active, then the difference cannot be noticed. diskd/aufs are only useful when the cache is under high load.

In case it was not clear, asyncronous I/O (diskd/aufs) is beneficial for single drive configurations with "higher" request loads, in many cases allowing you to push about 100% more I/O thru the drive before latency creeps up too high.

For multiple drive configurations, it is almost a requirement to be able to use the I/O capacity of the extra drives. Without it, a multiple disk configuration is effectively limited to almost the speed of a single disk configuration. With asyncronous I/O, the disk I/O scales quite well (at least for the first few drives, other limits gets very apparent when you have more than ~3 drives).

Cache_peer Options

proxy-only

Data retrieved from this remote cache will not be stored locally, but retrieved again on any subsequent request. By default, Squid will store objects it retrieves from other caches: by having the object available locally it can return the object fast, if it is requested again. This feature is often useful in a cluster of sibling caches to prevent each cache from holding every object. When the caches are close to each other (e.g. on the same ethernet segment), then it costs relatively little to transfer an object from one to the other. While this is good for latency, it can be a waste of bandwidth, especially if the other cache is on the same piece of ethernet. In the examples section of this chapter,we use this option when load-balancing between two cache servers.

Weight

If more than one cache server has an object (based on the result of an ICP query), Squid decides which cache to get the data from the cache that responded fastest. If you want to prefer one cache over another, you can add a weight value to the preferred cache's config line. Larger values are preferred. Squid times how long each ICP request takes (in milliseconds), and divides the time by the weight value, using the cache with the smallest result. A higher weight will artificially lower the calculated RTT between peers, thereby favoring it in the selection algorithm. Your weight value should thus not be an unreasonable value.

ttl

An outgoing multicast packet has a ttl (Time To Live) value, which is used to ensure that loops are not created. Each time a packet passes through a router, the router decrements this ttl value, and the value is then checked. Once the value reaches zero, the packet is dropped. If you want multicast packets to stay on your local network, you would set the ttl value to 1. The first router to see the packet would decrement the packet, discover the ttl was zero and discard it. This value gives you a level of control on how many multicast routers will see the packet. You should set this value carefully, so that you limit packets to your local network or immediate multicast peers (larger multicast groups are seldom of any use: they generate too many responses, and when geographically dispersed, may simply add latency. You also don't want crackers picking up all your ICP requests by joining the appropriate multicast group.)

no- query

Squid will send ICP requests to all configured caches. The response time is measured, and used to decide which parent to send the HTTP request to. There is another function of these requests: if there is no response to a request, the cache is marked down. If you are communicating with a cache that does not support ICP, you must use the no-query option: if you don't, Squid will consider that cache down, and attempt to go directly to the destination server. (If you want, you can set the ICP port on the config line to point to the echo port, port 7. Squid will then use this port to check if the machine is available. Note that you will have to configure inetd.conf to support the UDP echo port.) This option is normally used in conjunction with the default option and round-robin option.

cache_peer proxy.visolve.com1 parent 3128 3130 no-query default

Default

This sets the host to be the proxy of last resort. If no other cache matches a rule (due to acl or domain filtering), this cache is used. If you have only one way of reaching the outside world, and it doesn't support ICP, you can use the default and no-query options to ensure that, all queries are passed through it. If this cache is then down, the client will see an error message (without these options, Squid would attempt to route around the problem.)

round-robin
This option must be used on more than one cache_peer line to be useful. Connections to caches configured with this options are spread evenly (round-robined) among the caches. This can be used by client caches to communicate with a group of loaded parents, so that load is spread evenly. If you have multiple Internet connections, with a parent cache on each side, you can use this option to do some basic load-balancing of the connections.

In other words,the round-robin option is similar to default, except that Squid forwards the request to the parent with the lowest use count. The cache_peer_domain restrictions still apply, of course. A typical configuration might look like:

cache_peer proxy.visolve.com1 parent 3128 3130 round-robin no-query
cache_peer proxy.visolve.com2 parent 3128 3130 round-robin no-query
cache_peer proxy.visolve.com3 parent 3128 3130 round-robin no-query
Squid treats all round-robin parents equally. It is not currently possible to, e.g., forward 25% of the requests to one parent and 75% to another.

no-net-dbexchange

If your cache was configured to keep ICMP (ping) timing information with the --enable-icmp configure option, your cache will attempt to retrieve the remote machine's ICMP timing information from any peers. If you don't want this to happen (or the remote cache doesn't support it), you can use the no-netdb-exchange option to stop Squid from requesting this information from the cache

no-delay

Hits from other caches will normally be included into a client's delay-pool information. If you have two caches load-balancing, you don't want the hits from the other cache to be limited. You may also want hits from caches in a nearby hierarchy to come down at full speed, not to be limited as if they were misses. Use the no-delay option to ensure that requests come down at their full speed.

login
Caches can be configured to use usernames and passwords on accesses. To authenticate with a parent cache, you can enter a username and password using this tag. Note that, the HTTP protocol makes authenticating to multiple cache servers impossible: you cannot chain together a string of proxies, each one requiring authentication. You should only use this option if this is a personal proxy.

Round Trip time

It is the time interval between the sending of the first byte of an HTTP request for the request, until the last bytes of the server response has arrived at the requesting web client.

Probe

Squid will wait for up to dead_peer_timeout seconds after sending out an ICP request before deciding to ignore a peer. With a multicast group, peers can leave and join at will, and it should make no difference to a client. This presents a problem for Squid: it can't wait for a number of seconds each time (whatif thecaches are on the same network, and responses come back in milliseconds: the waiting just adds latency.) Squid gets around this problem by sending ICP probes to the multicast address occasionally. Each host in the group responds to the probe, and Squid will know how many machines are currently in the group. When sending a real request, Squid will wait until it gets atleast as many responses as were returned in the last probe: if more arrive, great. If less arrive, though, Squid will wait until the dead_peer_timeout value is reached. If there is still no reply, Squid marks that peer as down, so that all connections are not held up by one peer.

What is the httpd-accelerator mode?

An accelerator caches incoming requests for outgoing data (i.e., that which you publish to the world). It takes load away from your HTTP server and internal network. You move the server away from port 80 (or whatever your published port is), and substitute the accelerator, which then pulls the HTTP data from the ``real" HTTP server (only the accelerator needs to know where the real server is). The outside world sees no difference (apart from an increase in speed, with luck).

The httpd_accel_uses_host_header Option

The httpd_accel_uses_host_header option. A normal HTTP request consists of three values: the type of transfer (normally a GET, which is used for downloads); the path and filename to be retrieved (or executed, in the case of a cgi program); and the HTTP version.

This layout is fine if you only have one web site on a machine. On systems where you have more than one site, though, it makes life difficult: the request does not contain enough information, since it doesn't include information about the destination domain. Most operating systems allow you to have IP aliases, where you have more than one IP address per network card. By allocating one IP per hosted site, you could run one web server per IP address. Once the programs were made more efficient, one running program could act as a server for many sites: the only requirement was that you had one IP address per domain. Server programs would find out which of the IP addresses clients were connected to, and would serve data from different directories for each IP.

There are a limited number of IP addresses, and they are fast running out. Some systems also have a limited number of IP aliases, which means that you cannot host more than a (fairly arbitrary) number of web sites on machine. If the client were to pass the destination host name along with the path and filename, the web server could listen to only one IP address, and would find the right destination directores by looking in a simple hostname table.

From version 1.1 on, the HTTP standard supports a special Host header, which is passed along with every outgoing request. This header also makes transparent caching and acceleration easier: by pulling the host value out of the headers, Squid can translate a standard HTTP request to a cache-specific HTTP request, which can then be handled by the standard Squid code. Turning on the httpd_accel_uses_host_header option enables this translation. You will need to use this option when doing transparent caching.

It's important to note that acls are checked before this translation. You must combine this option with strict source-address checks, so you cannot use this option to accelerate multiple backend servers (this is certain to change in a later version of Squid).

The always_direct and never_direct tags

Squid checks all always_direct tags before it checks any never_direct tags. If a matching 'always_direct tag' is found, Squid will not check the never_direct tags, but decides which cache to talk to immediately. This behavior is demonstrated by the following example here, Squid will attempt to go the machine intranet, even though the same host is also matched by all acl.

Bypassing a parent for a local machine

cache_peer proxy.visolve.com parent 3128 3130
acl all src 0.0.0.0/0.0.0
acl localmachines dstdomain intranet.mydomain.example
never_direct allow all
always_direct allow localmachines

i.e.,
Let's consider a request destined for the web server intranet.mydomain.example. Squid first works through all the always_direct lines; the request is matched by the first (and only) line. The never_direct and always_direct tags are acl-operators, which means that the first match is considered. In this illustration, the matching line instructs Squid to go directly when the acl matches, so all neighboring peers are ignored for this request. If the line used the deny keyword instead of allow, Squid would have simply skipped on to checking of the former never_direct lines.

Now, suppose, a request arrives for an external host. Squid works through the always_direct lines, and finds that none of them match. The never_direct lines are then checked. The all acl matches the connection, so Squid marks the connection as never to be forwarded directly to the origin server.

Access.log details

The native access.log has ten (10) fields.There is one entry here for each HTTP (client) request and each ICP Query. HTTP requests are logged when the client socket is closed. A singledash (-) indicates unavailable data.

1. Timestamp
The time when the client socket is closed. The format is 'Unix time' (seconds since Jan 1, 1970) with millisecond resolution. This can be modified to visible format by 'cat access.log | perl -nwe 's/^(\d+)/localtime($1)/e; print';.

2. Elapsed Time
The elapsed time of the request, in milliseconds. This is time between the accept() and close() of the client socket.

3. Client Address
The IP address of the connecting client, or the FQDN if the 'log_fqdn' option is enabled in the config file.

4. Log Tag / HTTP Code
The Log Tag describes how the request was treated locally (hit, miss, etc). All the tags are described below. The HTTP code is the reply code taken from the first line of the HTTP reply header. Non-HTTP requests may have zero reply codes.

5. Size
The number of bytes written to the client.

6. Request Method
The HTTP request method, or ICP_QUERY for ICP requests.

7. URL
The requested URL.

8. Ident
If ident_lookup is on, this field may contain the username associated with the client connection as derived from the ident service.

9. Hierarchy Data / Hostname
A description of how and where the requested object was fetched.

10. Content Type
The Content-type field from the HTTP reply.

Access Log Tag / HTTP Code

"TCP_" refers to requests on the HTTP port.

TCP_ HIT
A valid copy of the requested object was in the cache.

TCP_MISS
The requested object was not in the cache.

TCP_REFRESH_HIT
The object was in the cache, but STALE. An If-Modified-Since request was made and a '304 Not Modified' reply was received.

TCP_REF_FAIL_HIT
The object was in the cache, but STALE. The request to validate the object failed, so the old (stale) object was returned.

TCP_REFRESH_MISS
The object was in the cache, but STALE. An If-Modified-Since request was made and the reply contained new content.

TCP_CLIENT_REFRESH
The client issued a request with the 'no-cache' pragma.

TCP_CLIENT_REFRESH_MISS
The client issued a "no-cache" pragma, or some analogous cache control command along with the request. Thus, the cache has to refetch the object from origin server. It is users pushing that reload-button forcing the proxy to check for a new copy (also triggered by selecting a bookmark in some browser versions). In short, the browser forced the proxy to check for a new version

TCP_IMS_HIT
The client issued an If-Modified-Since request and the object was in the cache and still fresh. TCP_HIT and TCP_IMS_HIT are hits, the only difference is that in the TCP_IMS_HIT case, the browser already had an up to date version, so there was no need to send the Squid cached copy to the requestor.

TCP_IMS_MISS
The client issued an If-Modified-Since request for a stale object.

TCP_SWAPFAIL
The object was believed to be in the cache, but could not be accessed.

TCP_DENIED
Access was denied for this request

"UDP_" refers to requests on the ICP port

UDP_HIT
A valid copy of the requested object was in the cache.

UDP_HIT_OBJ
Same as UDP_HIT, but the object data was small enough to be sent in the UDP reply packet. Saves the following TCP request.

UDP_MISS
The requested object was not in the cache.

UDP_DENIED
Access was denied for this request.

UDP_INVALID
An invalid request was received.

UDP_RELOADING
The ICP request was "refused" because the cache is busy, reloading its metadata.

SIBLING_HIT
The object was fetched from a sibling cache which replied with UDP_HIT.

PARENT_HIT
The object was requested from a parent cache which replied with UDP_HIT.

DEFAULT_PARENT
No ICP queries were sent. This parent was chosen because it was marked ``default'' in the config file.

FIRST_UP_PARENT
The object was fetched from the first parent in the list of parents.

NO_PARENT_DIRECT
The object was fetched from the origin server, because no parents existed for the given URL.

FIRST_PARENT_MISS
The object was fetched from the parent with the fastest (possibly weighted) round trip time.

CLOSEST_PARENT_MISS
This parent was chosen, because it included the the lowest RTT measurement to the origin server. See also the closests-only peer configuration option.

CLOSEST_PARENT
The Parent selection was based on our own RTT measurements.

Refresh Pattern

Squid switched from a Time-To-Live based expiration model to a Refresh-Rate model. Objects are no longer purged from the cache when they expire. Instead of assigning TTL's when the object enters the cache, we now check freshness requirements when objects are requested. If an object is 'fresh' it is given directly to the client. If it is 'stale' then we make an If-Modified-Since request for it.

Terms in delay pool

Pool:
A collection of bucket groups as appropriate to a given class.

bucket Pool:
a group of buckets within a pool, such as the per-host bucket group, the per-network bucket group or the aggregate bucket group (the aggregate bucket group is actually a single bucket).

bucket:
an individual delay bucket represents a traffic allocation, which is replenished at a given rate (up to a given limit) and causes traffic to be delayed when empty.

Classes:
There are 3 classes of delay pools - class 1 is a single aggregate bucket, class 2 is an aggregate bucket with an individual bucket for each host in the class C, and class 3 is an aggregate bucket, with a network bucket (for each class B) and an individual bucket for each host.

class:-
Class of a delay pool determines how the delay is applied, ie, whether the different client IPs are treated separately or as a group (or both).

class1:-
Class 1 delay pool contains a single unified bucket, which is used for all requests from hosts subject to the pool.

calss2:-
Class 2 delay pool contains one unified bucket and 255 buckets,one for each host on an 8-bit network

class3:-
It contains 255 buckets for the subnets in a 16-bit network, and individual buckets for every host on these networks (IPv4 class B)

Setting the parameters for each pool is done by :

delay_parameters pool aggregate network individual.

The variables here are:

where pool is a pool number , i.e., a number between 1 and the number specified in delay_pools as used in delay_class lines, aggregate is the parameter for the aggregate bucket, network for the network bucket, and individual for the individual bucket. Aggregate is only useful for classes 1, 2 and 3, individual for classes 2 and 3, and network for class 3. Each of these parameters is specified as restore / maximum - restore being the bytes per second restored to the bucket, and maximum being the amount of bytes that can be in the bucket at any time. It is important to remember that they are in bytes per second, not bits. To specify that a parameter is unlimited, use a -1.

If we wish to limit any parameter in bits per second, divide this amount by 8, and use the value for both the restore and the maximum. For example, to restrict the entire proxy to 64kbps, use:

delay_parameters 1 8000/8000

Ftp Login Information

Squid can act as a proxy server for various Internet protocols. The most commonly used protocol is HTTP, but the File Transfer Protocol (FTP) is still alive and well.

FTP was written for authenticated file transfer (it requires a username and password). To provide public access, a special account is created: the anonymous user. When you log into an FTP server you use this as your username. As a password, you generally use your email address. Most browsers these days automatically enter a useless email address.

It's polite to give an address that works, though. If one of your users abuses a site, it allows the site admin get hold of you easily.

Squid allows you to set the email address that is used with the ftp_user tag. You should probably create a squid@yourdomain.example email address specifically for people to contact you on.

There is another reason to enter a proper address here: some servers require a real email address. For your proxy to log into these ftp servers, you will have to enter a real email address here.

Effective User and Group ID

Squid can only bind to low numbered ports (such as port 80) if it is started as root. Squid is normally started by your system's rc scripts when the machine boots. Since these scripts run as root, Squid is started as root at bootup time.

Once Squid has been started, however, there is no need to run it as root. Good security practice is to run programs as root only when it's absolutely necessary, and for this reason Squid changes user and group ID's once it has bound to the incoming network port.

The cache_effective_user and cache_effective_group tags tell Squid what ID's to change to. The Unix security system would be useless if it allowed all users to change their ID's will, so Squid only attempts to change ID's if the main program is started as root.

If you do not have root access to the machine, and are thus not starting Squid as root, you can simply leave this option commented out. Squid will then run with whatever user ID starts the actual Squid binary.

Now let us assume that, you have created both a squid user and a squid group on your cache machine. The above tags should thus both be set to 'squid' .

Timeouts

Half closed clients:
The clients that shutsdown the sending side of their TCP connections, while leaving their receiving sides open, we term it as halfclosed clients. ie., the clients closes while the handshaking is in progress.

Fully closed clients:
The clients and servers have shared their acknowledgements(request and responses) before closing.

Persistent Connection:
Persistent Connection (keep alive) feature allows the same Connection to remain open for multiple requests. Obviously the drawback is that, the next request processing cannot start before the previous response has been sent by the server.

IDENT:
Squid will make an RFC931/ident request for client connections if 'ident_lookup' is enabled in the config file. Currently, the ident value is only logged with the request in the access.log. It is not currently possible to use the ident return value for access control purposes.

URN:
The URI architecture requires that a resource be named by a URN (Uniform Resource Name) and be retrieved by a URL (Uniform Resource Locator). A URC (Uniform Resource Characterstitic) binds the URN of a resource to one or more URLs. Once this system is activated, URNs will be used to "reference" information resources. World Wide Web clients will then send the URN for a desired resource to an international network of URN to URL resolvers (the URC service) that will return to the client one or more URLs that can be used to access the resource.

SIGHUP or SIGTERM:
The system signal sent to processes running in linux OS to shutdown.

External Programs

Htpasswd:
It is apache type passwd ,You can use this to create passwd for squid also.

The Syntax is:

htpasswd [ -c ] passwdfile username .

Redirector:
Squid now has the ability to rewrite requested URLs. Implemented as an external process (similar to a dns server), Squid can be configured to pass every incoming URL through a 'redirector' process that returns either a new URL, or a blank line to indicate no change.

The redirector program is NOT a standard part of the Squid package. However there are a couple of user-contributed redirectors in the "contrib/" directory. Since everyone has different needs, it is up to the individual administrators to write their own implementation. For testing, and a place to start, this very simple Perl script can be used:

#!/usr/local/bin/perl
$|=1;
print while (<>);

The redirector program must read URLs (one per line) on standard input, and write rewritten URLs or blank lines on standard output. Note that the redirector program can not use buffered I/O.

Ftp Passive Connections:
Ftp uses two data streams, one for passing commands around, the other for moving data. The command channel is handled by the ftpd listening on port 21.

The data channel varies depending on whether you ask for passive ftp or not. When you request data in a non-passive environment, you client tells the server ``I am listening on .'' The server then connects FROM port 20 to the ip address and port specified by your client. This requires your "security device" to permit any host outside from port 20 to any host inside on any port >1023. Somewhat of a hole.

In passive mode, when you request a data transfer, the server tells the client ``I am listening on .'' Your client then connects to the server on that IP and port and data flows.

Unlinkd Program:
Unlinkd is an external process used for unlinking old files in the cache to make room for newer object.

Pinger Process:
Squid ping program is an external program that provides Squid with icmp RTT information so that, it can more effectively choose between multiple remote parent caches for request fulfillment. There are special cases when this option is required, and your Squid must have been compiled with the --enable-icmp configure option for it to work. This option should only be used on caches with multiple parent caches on different networks that it must choose between. The default program to use for this task is called pinger. This option configures the pinger_program directive.

BYTES-hit ratio

The byte-hit ratio measures the ratio of total bytes from cached objects over the total bytes of objects requested.

<<Back

<< Contents >>

All rights reserved.
All trademarks used in this document are owned by their respective companies. This document makes no ownership claim of any trademark(s). If you wish to have your trademark removed from this document, please contact the copyright holder. No disrespect is meant by any use of other companies’ trademarks in this document.
Note: The pages on this website cannot be duplicated on to another site. Copying and usage of the contents for personal and corporate purposes is acceptable. In near future, it will be released under the GNU Free Documentation License.

© ViSolve.com 2002
Created By: squid@visolve.com	Date: May 15,2002
Revision No:0.0
Modified By	Date