Skip to content

Commit 9bd6748

Browse files
authored
v3.0.0
1 parent 0668baf commit 9bd6748

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed

src/User_Manual/vision.html

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6+
<title>Vision Models</title>
7+
<style>
8+
body {
9+
font-family: Arial, sans-serif;
10+
line-height: 1.6;
11+
margin: 0;
12+
padding: 0;
13+
background-color: #161b22;
14+
color: #d0d0d0;
15+
}
16+
17+
header {
18+
text-align: center;
19+
background-color: #3498db;
20+
color: #fff;
21+
padding: 20px;
22+
position: sticky;
23+
top: 0;
24+
z-index: 999;
25+
}
26+
27+
main {
28+
max-width: 800px;
29+
margin: 0 auto;
30+
padding: 20px;
31+
}
32+
33+
img {
34+
display: block;
35+
margin: 0 auto;
36+
max-width: 100%;
37+
height: auto;
38+
}
39+
40+
h1 {
41+
color: #333;
42+
}
43+
44+
h2 {
45+
color: #f0f0f0;
46+
text-align: center;
47+
}
48+
49+
p {
50+
text-indent: 35px;
51+
}
52+
53+
table {
54+
border-collapse: collapse;
55+
width: 80%;
56+
margin: 50px auto;
57+
}
58+
59+
th, td {
60+
text-align: left;
61+
padding: 8px;
62+
border-bottom: 1px solid #ddd;
63+
}
64+
65+
th {
66+
background-color: #f2f2f2;
67+
color: #000;
68+
}
69+
70+
footer {
71+
text-align: center;
72+
background-color: #333;
73+
color: #fff;
74+
padding: 10px;
75+
}
76+
77+
code {
78+
background-color: #f9f9f9;
79+
border-radius: 3px;
80+
padding: 2px 3px;
81+
font-family: "SFMono-Regular", Consolas, "Liberation Mono", Menlo, monospace;
82+
color: #333;
83+
}
84+
</style>
85+
86+
</head>
87+
88+
<body>
89+
<header>
90+
<h1>Vision</h1>
91+
</header>
92+
93+
<main>
94+
95+
<section>
96+
97+
<h2 style="color: #f0f0f0;" align="center">What are Vision Models?</h2>
98+
99+
<p>Vision models are basically large language models that can analyze and extract information from a variety of images.
100+
For purposes of this program, vision models are used to extract a summary of what an image depicts and add this description
101+
to the vector database where it can be searched along with any traditional documents you add!</p>
102+
103+
<h2 style="color: #f0f0f0;" align="left">Which Vision Models Are Available?</h2>
104+
105+
<p>There are three named vision models available with this program:</p>
106+
107+
<ol>
108+
<li>llava</li>
109+
<li>bakllava</li>
110+
<li>cogvlm</li>
111+
</ol>
112+
113+
<p><code>llava</code> models were trailblazers in what they did and this program uses both the 7b and 13b sizes.
114+
<code>llava</code> models are based on the <code>llama2</code> architecture. <code>bakllava</code> is similar to
115+
<code>llava</code> except that it's architecture is based on <code>mistral</code> and only comes in the 7b variety.
116+
<code>cogvlm</code> has <u>18b parameters</u> but is my personal favorite because it produces the bset results by far. Its
117+
accuracy is over 90% in the statements its summaries I've found whereas <code>bakllava</code> is only about 70% and
118+
<code>llava</code> is slightly lower than that (regardless of whether you use the 7b or 13b sizes).</p>
119+
120+
<h2 style="color: #f0f0f0;" align="center">What do the Settings Mean?</h2>
121+
122+
<p><code>Model</code> is obviously the model's name. Note that you cannot use <code>cogvlm</code> on MacOS, which is
123+
because it requires the <code>xformers</code> library, which does not currently make a build for MacOs.</p>
124+
125+
<p><code>Size</code> refers to the number of parameters (in billions). Larger generally means better, but in contrast to
126+
differing parameters with typically large language models, I didn't notice a difference between using the <code>llava</code>
127+
7b versus 13b sizes, but feel free to experiment. The Tool Tab contains a table outlining the general VRAM requirements
128+
for the various models/settings. Remember, this is <b><u>before</u></b> accounting for overhead such as your monitor, which
129+
typically amounts to <code>1-2 GB more</code></p>
130+
131+
<p><code>Quant</code> refers to the quantization of the model - i.e. how much it's reduced from its original floating point
132+
format. See the tailend of the Whisper portion of the User Guide for a primer on what floating point formats are. This
133+
program uses the <code>bitsandbytes</code> library to perform the quantizations because it's the only option I was aware of
134+
that could quantize <code>cogvlm</code>, which is far superior IMHO.</p>
135+
136+
<h2 style="color: #f0f0f0;" align="center">Why Are Some Settings Disabled?</h2>
137+
138+
<p><code>Flash Attention 2</code> is a very powerful newer technology but it requires <code>CUDA 12+</code>. This program relies
139+
exclusively on <code>CUDA 11</code> due to compatibility with the <code>faster-whisper</code> library that handles the audio
140+
features. However, <code>faster-whisper</code> should be adding <code>CUDA 12+</code> support in the near future, at which
141+
time <code>Flash Attention 2</code> should be available. <code>Batch</code> will be explained and added in a future release.</p>
142+
143+
<h2 style="color: #f0f0f0;" align="center">How do I use the Vision Model?</h2>
144+
145+
<p>Before <code>Release 3</code>, this program put all documents selected within the "Docs_for_DB" folder. Now it puts any
146+
images selected in the "Images_for_DB" folder. You can manually remove images from there if need be. Once documents and/or
147+
images are selected, you simply click the <code>create database</code> button like before. The document processor will run
148+
in two steps. First, it will load non-images and second it'll load any images.</p>
149+
150+
<p>The "loading" process takes very little time for documents but a relatively long time for images. "Loading" images involves
151+
creating the summaries for each image using the selected vision model. Make sure and test your vision model settings within
152+
the Tools Tab before committing to processing, for example, 100 images.</p>
153+
154+
<p>After both documents and images are "loaded" they are added to the vectorstore just the same as prior release of this
155+
program.</p>
156+
157+
<p>Once the database is "persisted," try searching for images that depict a certain thing. Also, you can check the
158+
<code>chunks only</code> checkbox to actually see the results returned to the database instead of connecting to LM Studio.
159+
This is extremely useful to fine-tune your settings...including both the chunking/overlap settings as well as the Vision
160+
model settings.</p>
161+
162+
<p>PRO TIP: Make sure and set your chunking settings to larger than the summaries that are provided by the vision model.
163+
Doing this prevents the summary for a particular image from EVER being split. In short, each and every chunk consist of the
164+
<u>entire summary</u> provided by the vision model! This tends to be 400-800 chunk size depending on the vision model
165+
settings.</p>
166+
167+
<h2 style="color: #f0f0f0;" align="center">Can I Change What the Vision Model Does?</h2>
168+
169+
<p>For this initial release, I hardcoded the questions asked of the vision models within the following scripts:</p>
170+
171+
<ol>
172+
<li><code>vision_cogvlm_module.py</code></li>
173+
<li><code>vision_llava_module.py</code></li>
174+
<li><code>loader_vision_cogvlm.py</code></li>
175+
<li><code>loader_vision_llava.py</code></li>
176+
</ol>
177+
178+
<p>You can go into these scripts and modify the question sent to the vision model, but make sure the prompt format remains
179+
the same. In future releases I will likely add the functionality to experiement with different questions within the
180+
grapical user interface to achieve better results.</p>
181+
182+
</main>
183+
184+
<footer>
185+
www.chintellalaw.com
186+
</footer>
187+
</body>
188+
</html>

0 commit comments

Comments
 (0)