MSFN Forum: How to check text file encoding from command-line? - MSFN Forum

Jump to content


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

How to check text file encoding from command-line? Rate Topic: -----

#1 User is offline   tomasz86 

  • http://www.windows2000.tk
  • PipPipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,220
  • Joined: 27-November 10
  • OS:Windows 2000 Professional
  • Country: Country Flag

Posted 07 October 2012 - 07:37 AM

Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.


#2 User is offline   allen2 

  • Not really Newbie
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,733
  • Joined: 13-January 06

Posted 07 October 2012 - 09:34 AM

You could use file -i using the gnuwin32 build.

#3 User is online   jaclaz 

  • The Finder
  • Group: Developers
  • Posts: 11,419
  • Joined: 23-July 04
  • OS:none specified
  • Country: Country Flag

Posted 07 October 2012 - 09:38 AM

View Posttomasz86, on 07 October 2012 - 07:37 AM, said:

Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.

UCS-2 Little Endian sounds to me a lot like "Unicode". :whistle:

The simplest check you can make is looking for a hex 00 (if there is at least one, it's Unicode, conversely, if there are none, it's not Unicode - and it is very likely to be "plain ANSI text").
http://betterexplain...ticles/unicode/

jaclaz

#4 User is offline   tain 

  • Cyber Ops
  • Group: Super Moderator
  • Posts: 3,557
  • Joined: 24-September 05
  • OS:none specified
  • Country: Country Flag

Posted 07 October 2012 - 09:39 AM

Nice one, jaclaz!

#5 User is offline   tomasz86 

  • http://www.windows2000.tk
  • PipPipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,220
  • Joined: 27-November 10
  • OS:Windows 2000 Professional
  • Country: Country Flag

Posted 07 October 2012 - 03:59 PM

Thanks for you suggestions!

@allen2

It would be nice to be able to do it without any external application although I'll have a look at it if I can't manage to do it with just the Windows default tools.


@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

This post has been edited by tomasz86: 07 October 2012 - 04:00 PM


#6 User is offline   jumper 

  • Masters HJ/TJ'er (back in training)
  • PipPipPip
  • Group: Members
  • Posts: 359
  • Joined: 21-January 11
  • OS:98SE
  • Country: Country Flag

Posted 08 October 2012 - 01:07 AM

Attached File  endian.zip (988bytes)
Number of downloads: 11 - endian.exe (console-32 app), endian.bat (sample usage)

endian.exe source snippet based on header info detailed at http://betterexplain...ticles/unicode/:
  // read first three (or more) bytes from file into byte array s[], then:
  return
    (s[0]==255) && (s[1]==254)? 255 :
    (s[0]==254) && (s[1]==255)? 254 :
    (s[0]==0xEF) && (s[1]==0xBB) && (s[2]==0xBF)? 239 : 0;

Sample usage: endian.bat
@echo off
%0\..\endian %1

IF ERRORLEVEL 255 GOTO UCS2LE
IF ERRORLEVEL 254 GOTO UCS2BE
IF ERRORLEVEL 239 GOTO UTF8

echo ANSI
GOTO End

:UCS2LE
echo UCS-2 Little Endian
GOTO End

:UCS2BE
echo UCS-2 Big Endian
GOTO End

:UTF8
echo UTF-8

:End



If debug.exe is available, I think it can be used to achieve the same results. Debug can be scripted to open the file, analyze the first three bytes, and create a temp com file that sets an appropriate ERRORLEVEL If Debug doesn't set the ERRORLEVEL upon exit itself, the temp file can be eliminated by running the temp program within Debug.

#7 User is online   jaclaz 

  • The Finder
  • Group: Developers
  • Posts: 11,419
  • Joined: 23-July 04
  • OS:none specified
  • Country: Country Flag

Posted 08 October 2012 - 02:18 AM

View Posttomasz86, on 07 October 2012 - 03:59 PM, said:

@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

Well, you have gsar already used in your "standard" set of tools, haven't you?

I doubt - no offence intended :) - that an ANSI file contains hex 00, but you can (better :thumbup ) use gsar to find the initial FFFE as jumper suggested.
In any case, FOR/ F won't "like" Unicode, thus:


@ECHO OFF
::ISANSI.CMD - small example batch to check if a file is ANSI or UNICODE
FOR /F %%A in  (%1) DO ECHO ANSI&GOTO :EOF
ECHO UNICODE

BUT this won't work the same:
@ECHO OFF
::NOANSI.CMD - small example batch that will always return ANSI
FOR /F %%A in  ('TYPE %1') DO ECHO ANSI&GOTO :EOF
ECHO UNICODE


jaclaz



jaclaz

#8 User is offline   tomasz86 

  • http://www.windows2000.tk
  • PipPipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,220
  • Joined: 27-November 10
  • OS:Windows 2000 Professional
  • Country: Country Flag

Posted 08 October 2012 - 02:25 AM

Well, in this particular script the only "non-Windows" tool is gsar.exe...

But at the moment I actually managed to solve this specific problem with this very simple checking:

SETLOCAL ENABLEDELAYEDEXPANSION
FINDSTR/IL "[Version]" I386\hivedef.inf >NUL
IF !ERRORLEVEL! EQU 0 (
	TYPE temp.txt>>I386\hivedef.inf
) ELSE (
	CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
)

FINDSTR can't find "[Version]" if the file is encoded in UCS-2 Little Endian.

This post has been edited by tomasz86: 08 October 2012 - 02:26 AM


#9 User is offline   allen2 

  • Not really Newbie
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,733
  • Joined: 13-January 06

Posted 08 October 2012 - 02:32 AM

The BOM (if present) might help to detect to detect the encoding of the file.
And this vbs will report the encoding of the file passed as argument:
Function encoding(fpn) 
set file=CreateObject("ADODB.Stream") 
file.Type=1
file.Open 
file.LoadFromFile fpn 
data = file.Read 
file.Close 
a = hex(Ascb(Midb(data, 1, 1)))
b = hex(Ascb(Midb(data, 2, 1))) 
c = hex(Ascb(Midb(data, 3, 1))) 
d= hex(Ascb(Midb(data, 4, 1)))
encoding="unknow ascii"
If a = "EF" AND b = "BB" AND c = "BF" Then encoding = "UTF-8" 
If (a = "FE" AND b = "FF" AND not c = "00" ) then encoding = "UTF-16 (BE)"
If (a = "FF" AND b = "FE") Then encoding = "UTF-16 (LE)" 
If (a = "00" AND b = "00" AND c = "FE" AND d = "FF" ) then encoding = "UTF-32 (BE)"
If (a = "FF" AND b = "FE" AND c = "00" AND d = "00" ) then encoding = "UTF-32 (LE)"
If (a = "2B" AND b = "2F" AND c = "76" AND (d = "38" or d = "39" or d = "2B" or d = "2F" )) then encoding = "UTF-7"
If (a = "F7" AND b = "64" AND c = "4C") then encoding = "UTF-1"
If (a = "DD" AND b = "73" AND c = "66" AND d = "73") then encoding = "UTF-EBCDIC"
If (a = "0E" AND b = "FE" AND c = "FF") then encoding = "SCSU"
If (a = "FB" AND b = "EE" AND c = "28") then encoding = "BOCU-1"
If (a = "84" AND b = "31" AND c = "95" AND d = "33") then encoding = "GB-18030"
End Function 
wscript.echo encoding(WScript.Arguments.Item(0))



But many files doesn't contain the BOM.

#10 User is online   jaclaz 

  • The Finder
  • Group: Developers
  • Posts: 11,419
  • Joined: 23-July 04
  • OS:none specified
  • Country: Country Flag

Posted 08 October 2012 - 03:26 AM

It is clear how we all give a different meaning to the word "simple". :whistle:
FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip


jaclaz

#11 User is offline   Yzöwl 

  • Wise Owl
  • Group: Super Moderator
  • Posts: 4,363
  • Joined: 13-October 04
  • OS:Windows 7 x64

Posted 08 October 2012 - 09:13 AM

As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

#12 User is offline   tomasz86 

  • http://www.windows2000.tk
  • PipPipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,220
  • Joined: 27-November 10
  • OS:Windows 2000 Professional
  • Country: Country Flag

Posted 09 October 2012 - 12:58 AM

View Postjaclaz, on 08 October 2012 - 03:26 AM, said:

It is clear how we all give a different meaning to the word "simple". :whistle:
FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip


Thank you :w00t: I meant simpler when compared to the other options where other tools must be used.


View PostYzöwl, on 08 October 2012 - 09:13 AM, said:

As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

It should be possible in case of these HIVE*.INF files because they utilise only English and Korean characters but how about unicode files where a lot of different languages are used like this one?

Attached File  Netrtle.7z (32.31K)
Number of downloads: 3

If you check the [Strings.*] sections you'll see that there are a lot of them for several different languages. If I try to convert such a file to ANSI then many characters are lost.

#13 User is offline   allen2 

  • Not really Newbie
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,733
  • Joined: 13-January 06

Posted 09 October 2012 - 02:02 AM

Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

#14 User is offline   tomasz86 

  • http://www.windows2000.tk
  • PipPipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,220
  • Joined: 27-November 10
  • OS:Windows 2000 Professional
  • Country: Country Flag

Posted 09 October 2012 - 02:41 AM

View Postallen2, on 09 October 2012 - 02:02 AM, said:

Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

It's just a typo of mine :blushing: You can safely ignore it.

#15 User is offline   jumper 

  • Masters HJ/TJ'er (back in training)
  • PipPipPip
  • Group: Members
  • Posts: 359
  • Joined: 21-January 11
  • OS:98SE
  • Country: Country Flag

Posted 09 October 2012 - 09:28 PM

For those with pre-ME systems that don't support FINDSTR, here's a batch file that uses FIND to process one file from the command line or a whole directory of files:
@echo off
if not %1*==* goto TEST

for %%s in (*.inf) do call %0 %%s
goto EXIT

:TEST
find>nul /i "[Version]" %1
IF ERRORLEVEL 1 goto NONANSI

:ANSI
echo %1 is ANSI
goto EXIT

:NONANSI
echo %1 is non-ANSI

:EXIT


Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

2 User(s) are reading this topic
0 members, 2 guests, 0 anonymous users



All trademarks mentioned on this page are the property of their respective owners
Copyright © 2001 - 2013 msfn.org
Privacy Policy